Methodological issues for spontaneous speech corpora compilation
The case of C-ORAL-BRASIL
Spontaneous Speech Corpus Compilation has been going through a growing period in the past 20 years. This is due majorly to technological advances that have been achieved allowing for highly accurate recording in vivo, new insights coming from empirically-based linguistic theory, concerns for the documentation of threatened languages and the high degree of relevance of findings to speech recognition applications. This paper discusses methodologies associated to spontaneous speech corpus compilation which shed light on specific aspects of relevance to the understanding of linguistic phenomena that pertain to spoken language. The compilation process of C-ORAL-BRASIL I, an informal spontaneous speech Brazilian Portuguese corpus, among other examples, is used as the basis for the discussion carried.
References (75)
Allwood, Jens. 2002. Bodily communications. Dimensions of expression and content. In Multimodality in Language and Speech Systems, Björn Granström, David House & Inger Karlsson (eds), 7–26. Dordrecht: Kluwer.
Austin, John L. 1962. How to do Things with Words. Oxford: OUP.
Berruto, Gaetano. 1987. Sociolinguistica dell’italiano contemporaneo. Roma: La Nuova Italia Scientifica.
Berruto, Gaetano. 1993a. Le varietà del repertorio. In Introduzione all’italiano contemporaneo, Alberto A. Sobrero (ed.). Roma-Bari: Laterza 2: 3–36.
Berruto, Gaetano. 1993b. Varietà diamesiche, diastratiche, diafasiche. In Introduzione all’italiano contemporaneo, Alberto A. Sobrero (ed.). Roma-Bari: Laterza 2: 37–92.
Berruto, Gaetano. 2011. Registri, stili: Alcune considerazioni su categorie mal definite. In La variazione di registro nella comunicazione elettronica, Massimo Cerruti, Elisa Corino & Christina Onesti (eds), 15–35. Roma: Carocci.
Biber, Douglas & Conrad, Susan. 2009. Register variation: A corpus approach. In The Handbook of Discourse Analysis, Deborah Schiffrin, Deborah Tannen & Heidi E. Hamilton (eds), 175–196. Oxford: Blackwell.
Biber, Douglas, Conrad, Susan & Reppen, Randi. 1998. Corpus linguistics: Investigating language structure and use. Cambridge: CUP.
Biber, Douglas, Johansson, Stig, Leech, Geoffrey, Conrad, Susan & Finegan, Edward. 1999. The Longman Grammar of Spoken and Written English. London: Longman.
Chomsky, Noam. 1970. Remarks on nominalization. In Readings in English Transformational Grammar, Roderick A. Jacobs & Peter S. Rosenbaum (eds), 184–221. Waltham MA: Blaisdell.
Cresti, Emanuela. 2000. Corpus di italiano parlato, 2 Vols. Firenze: Accademia della Crusca.
Cresti, Emanuela. 2001. Per una nuova definizione di frase. In Studi di storia della lingua italiana offerti a Ghino Ghinassi, Paolo Bongrani, Andrea Dardi, Massimo Fanfani & Riccardo Tesi (Eds.), 511–550. Firenze: Le Lettere.
Cresti, E. 2005a. Notes on lexical strategy, structural strategy and surface clause indexes in the C-ORAL-ROM spoken corpora. In Cresti & Moneglia (eds), 209–256.
Cresti, Emanuela. 2005b. Enunciato e frase: Teoria e verifiche empiriche. In Italia linguistica: Discorsi di scritto e di parlato. Nuovi studi di linguistica italiana per Giovanni Nencioni, Marco Biffi, Omar Calabrese & Luciana Salibra (eds), 249–260. Siena: Protagon.
Cresti, Emanuela & Gramigni, Paola. 2004. Per una linguistica corpus based dell’italiano parlato: Le unità di riferimento. In Atti del Convegno ‘L’italiano parlato’, Federico Leoni Albano, Francesco Cutugno, Massimo Pettorino & Renata Savy (eds). Napoli: D’Auria.
Cresti, Emanuela & Raso, Tommaso. 2012. Text annotation of information units through IPIC. LABLITA [URL]
Dittmar, Norbert. 2004. Register. In Handbuch der Soziolinguistik / Handbook of Sociolinguistics, Vol.1, Ulrich Ammon, Norbert Dittmar, Klaus J. Mattheier & Peter Trudgill (eds), 2016–226. Berlin: De Gruyter.
Du Bois, John W., Chafe, Wallace L., Meyer, Charles, Thompson, Sandra A., Englebretson, Robert & Martey, Nii. 2000–2005. Santa Barbara Corpus of Spoken American English, Parts 1–4. Philadelphia PA: Linguistic Data Consortium.
EAGLES Standards. 1996. [URL]
Edwards, Jane A. 1993. Principles and contrasting systems of discourse transcription. In Talking data: Transcription and coding in discourse research. Jane A. Edwards & Martin D. Lampert (eds), 3–31. Hillsdale NJ: Lawrence Erlbaum Associates.
Firenzuoli, Valentina. 2003. Le forme intonative di valore illocutivo dell’italiano parlato: Analisi sperimentale di un crpus di parlato spontaneo (LABLITA). PhD dissertation, University of Florence.
Fleiss, Joseph L. 1971. Measuring nominal scale agreement among many raters. Psychological Bulletin 76(5): 378–382.
Gadet, F. 2000. Vers une sociolinguistique des locuteurs. Sociolinguististica 14: 99–103.
Gadet, Françoise. 2003. La variation sociale en français. Paris: Ophrys.
Gregori, Lorenzo & Panunzi, Allesandro. 2012.
DB-IPIC: An XML database for informational patterning analysis
. In
Proceedings of the 7th GSCP International Conference. Speech and Corpora
, Heliana Mello, Massimo Pettorino & Tommaso Raso (eds), 121–127. Florence: Firenze University Press.
Halliday, Michael A.K. 1989. Spoken and Written Languages. Oxford: OUP.
van den Heuvel, Henk, Boves, Louis, Choukri, Khalid, Goddijn, Simo & Sanders, Eric 2000. SLR validation: Present state of affairs and prospects. In
Proceedings of the 2nd International Conference on Language Resource and Evaluation (LREC 2000)
, 435–440. Paris: ELRA.
Johansson, Stig. 1995a. The approach of the Text Encoding Initiative to the encoding of spoken discourse. In Leech, Meyers & Thomas (eds), 82–98.
Johansson, Stig. 1995b. The encoding of spoken texts. Computers and the Humanities 29(1): 149–158. Also in Ide, Nancy & Véronis, Jean. 1995. The Text Encoding Initiative. Background and Context, 149–158. Dordrecht: Kluwer.
Karcevsky, Serge. 1931. Sur la phonologie de la phrase. Travaux du Cercle Linguistique de Prague IV: 188–228.
Labov, William. 1966. The Social Stratification of English in New York City. Washington DC: Center for Applied Linguistics.
Labov, William & Waletzky, Joshua. 1967. Narrative analysis. In Essays on the Verbal and Visual Arts, June Helm (ed.), 12–44. Seattle, WA: University of Washington Press.
Leech, Geoffrey, Myers, Greg & Thomas, Jenny (eds). 1995. Spoken English on Computer. Transcription, Markup and Applications. Harlow: Longman.
Llisterri, Joaquim. 1996. Preliminary recommendations on spoken texts. EAGLES Documents EAG-TCWG-STP/P. [URL]
MacWhinney, Brian J. 2000. The CHILDES Project: Tools for Analyzing Talk. Mahwah NJ: Lawrence Erlbaum Associates.
Mc Neill, David (ed.). 2000. Language and Gesture. Cambridge: CUP.
Mc Neill, David. 2012. How Language Began. Cambridge: CUP.
Mello, Heliana & Raso, Tommaso. 2009. Para a transcrição da fala espontânea: O caso do C-ORAL-BRASIL. Revista Portuguesa de Humanidades – Estudos Linguísticos 13(1): 153–178.
Mello, Heliana, Raso, Tommaso, Mittmann, Maryualê M., Vale, Heloisa P. & Côrtes, Priscila O. 2012. Transcrição e segmentação prosodic do corpus C-ORAL-BRASIL: Critérios de implementação e validação. In C-ORAL – Brasil I: Corpus de referência do português brasileiro falado informal, Tommaso Raso & Heliana Mello (eds), 125–176. Belo Horizonte: Editora UFMG.
Mello, Heliana, Raso, Tommaso, Mittmann, Maryualê M. & Furtado, D. DBCom: C-ORAL-BRASIL search engine platform. Forthcoming.
Mettouchi, Amina, Lacheret-Dujour, Anne, Silber-Varod, Vered, Izre’el, Shlomo. 2007. Only prosody? Perception of speech segmentation in Kabyle and Hebrew. Nouveaux Cahiers de Linguistique Française 28: 207–218.
Mettouchi, Amina, Caubet, Dominique, Vanhove, Martine, Tosco, Mauro, Comrie, Bernard & Izre’el, Shlomo. 2010. CORPAFROAS. A corpus for spoken Afroasiatic languages: Morphosyntactic and prosodic analysis. In CAMSEMUD 2007, Frederick Mario Fales & Giulia Francesca Grassi (eds), 177–180. Padova: SARGON.
Moneglia, Massimo. 2011. Spoken corpora and pragmatics. Revista Brasileira de Linguística Aplicada 11(2): 479–519.
Moneglia, Massimo & Cresti, Emanuela. 1997. L’intonazione e I criteri di trascrizione del parlato adulto e infantile. In Il progettto CHILDES Italia, Umberta Bortolini & Elen Pizzuto (eds), 57–90. Pisa: Del Cerro.
Moneglia, Massimo, Scaarano, Antonietta & Spinu, Marius. 2005. The multilingual corpus of spontaneous speech C-ORAL-ROM: Validation of the prosodic annotation by expert transcribers. In
Atti della Conferenza CLiP 2003
, Carlotta Nicolas Martinez & Massimo Moneglia (eds), 127–142. Firenze: Firenze University Press.
Moneglia, Massimo & Scarano, Antonietta. 2008. Il Corpus Stammerjohann. Il primo corpus di italiano parlato, in rete nella base dati di LABLITA. In Atti del convegno internazionale ‘La comunicazione parlata’, Tomo III, Massimo Pettorino (ed.), 1650–1685. Napoli: Liguori.
Moneglia, Massimo & Cresti, Emanuela. Forthcoming. The cross-linguistic comparison of information patterning in spontaneous speech corpora: Data from C-ORAL-ROM ITALIAN and C-ORAL-BRASIL. In Linguistique interactionnelle contrastive. Grammaire et interaction dans les langues romanes, Sabine Diao-Klaeger & Britta Thörle (eds). Tübingen: Stauffenburg.
Nencioni, Giovanni. 1976. Parlato-parlato, parlato-scritto, parlato-recitato. Strumenti Critici 10: 1–56. Also in Nencioni, Giovanni. 1983. Di scritto e parlato. Discorsi linguistici, 126–179. Bologna: Zanichelli.
Oostdijk, Nelleke, Goedertier, Wim, Van Eynde, Frank, Boves, Louis, Martens, Jean-Pierre, Moortgat, Michael, Baayen, R. Harald. 2002. Experiences from the Spoken Dutch Corpus Project. In
Proceedings from the Third International Conference on Language Resources and Evaluations
, Manuel Gonzalez-Rodriguez & Carmen Paz Suárez Araujo (eds), 330–347. Las Palmas de Gran Canaria.
Panunzi, Allesandro & Gregori, Lorenzo. 2012. DB-IPIC. An XML database for the representation of information structure in spoken language. In Pragmatics and Prosody. Illocution, Modality, Attitude, Information Structure and Speech Annotation, Heliana Mello, Allesandro Panunzi & Tommaso Raso (eds), 19–37. Florence: Firenze University Press.
Poggi, Isabella. 2007. Mind, Hands, Face and Body. A Goal and Belief View of Multimodal Communication. Berlin: Werdler.
Raso, Tommaso. 2012a. O corpus C-ORAL-BRASIL. In Raso & Mello (eds), 55–90.
Raso, Tommaso. 2012b. O C-ORAL-BRASIL e a teoria da língua em ato. In Raso & Mello (eds), 91–124.
Raso, Tommaso. 2012c. Specifications. In Mello & Raso (eds).
Raso, Tommaso. In press. Fala e escrita: Meio, canal, consequências pragmáticas e linguísticas. Domínios da Linguagem.
Raso, Tommaso & Mello, Heliana (eds). 2012. C-ORAL – Brasil I: Corpus de referência do português brasileiro falado informal. Belo Horizonte: Editora UFMG.
Raso, Tommaso & Mittmann, Maryualê M. 2009. Validação estatística dos critérios de segmentação da fala espontânea no corpus C-ORAL-BRASIL. Revista de Estudos da Linguagem 17(2): 73–91.
Raso, Tommaso & Mittmann, Maryualê M. 2012. As principais medidas da fala. In Raso & Mello (eds).
Rocha, Bruno. 2013. Metodologia emírica para o estudo de ilocuções no PB. Domínios de Linguagem 14: 109–148.
Rossi, Fabio. 2001. Varietà diamesica. In Enciclopedia dell’italiano, 1540–1542. Roma: Treccani.
Rossini, Nicla. 2012. Language ‘in action’: Reinterpreting Gesture as Language. Amsterdam: IOS Press.
Scarano, Antonietta. 2004. Enunciati nominali in un corpus di italiano parlato. Appunti per una grammatica corpus based. In Atti del Convegno ‘L’italiano parlato’, Federico Leoni Albano, Francesco Cutugno, Massimo Pettorino & Renata Savy (eds). Napoli: D’Auria.
Schiel, Florian, Baumann, Angela, Draxler, Christoph, Ellbogen, Tania, Hoole, Phil & Steffen, Alexander. 2004. The Validation of Speech Corpora. Munich: University of Munich.
Signorini, Sabrina & Tucci, Ida. 2004. Il restauro e l’ archiviazione elettronica del primo corpus di italiano parlato: Il corpus Stammerjohann. In Costituzione, Gestione e restauro di corpora vocali, Atti delle XIV Giornate del GFS, Collana degli atti dell’associazione italiana di acustica. Viterbo, 4–6 dicembre 2003, Amedeo De Dominicis, Laura Mori & Marianna Stefani (eds), 119–126. Roma: Esagrafica.
Sinclair, John. 1996. Preliminary recommendations on corpus typology. EAGLES Document EAG-TCWG-CTYP/P. [URL]
Teubert, Wolfgang. 1993. Phonetic / Phonemic and Prosodic Annotation. NERC-WP 8-171. Mannheim: IDS.
Thompson, Paul. 2005. Spoken language corpora. In Developing Linguistic Corpora: A Guide to Good Practice, Martin Wynne (ed.), 59–70. Oxford: Oxbow Books.
Winski, Richard, Moore, Roger & Gibbon, Dafydd. 1995. EAGLES Spoken Language Working Group: Overview and results. In
Eurospeech’95. Proceedings of the 4th European Conference on Speech Communication and Speech Technology
, 18–21 September, Vol 1, 841–844. Madrid, Spain.
Woodbury, A. 2003. Defining documentary linguistics. In Language Documentation and Description, 1: HRELP, Peter Austin (ed.). London: SOAS.
audio
Example 1
Example 2
Example 3
Example 4
Example 5
Example 6
Example 7
Example 8
Example 9
Example 10
Example 11
Example 12
Example 13
Example 14
Example 15
Example 16
Cited by (2)
Cited by two other publications
Bossaglia, Giulia, Heliana Mello & Tommaso Raso
This list is based on CrossRef data as of 25 july 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.