Methodological issues for spontaneous speech corpora compilation
The case of C-ORAL-BRASIL
Spontaneous Speech Corpus Compilation has been going through a growing period in the past 20 years. This is due majorly to technological advances that have been achieved allowing for highly accurate recording in vivo, new insights coming from empirically-based linguistic theory, concerns for the documentation of threatened languages and the high degree of relevance of findings to speech recognition applications. This paper discusses methodologies associated to spontaneous speech corpus compilation which shed light on specific aspects of relevance to the understanding of linguistic phenomena that pertain to spoken language. The compilation process of C-ORAL-BRASIL I, an informal spontaneous speech Brazilian Portuguese corpus, among other examples, is used as the basis for the discussion carried.
References
Allwood, Jens
2002 Bodily communications. Dimensions of expression and content. In
Multimodality in Language and Speech Systems,
Björn Granström,
David House &
Inger Karlsson (eds), 7–26. Dordrecht: Kluwer.


Austin, John L
1962 How to do Things with Words. Oxford: OUP.

Berruto, Gaetano
1987 Sociolinguistica dell’italiano contemporaneo. Roma: La Nuova Italia Scientifica.

Berruto, Gaetano
1993a Le varietà del repertorio. In
Introduzione all’italiano contemporaneo,
Alberto A. Sobrero (ed.). Roma-Bari: Laterza 2: 3–36.

Berruto, Gaetano
1993b Varietà diamesiche, diastratiche, diafasiche. In
Introduzione all’italiano contemporaneo,
Alberto A. Sobrero (ed.). Roma-Bari: Laterza 2: 37–92.

Berruto, Gaetano
2011 Registri, stili: Alcune considerazioni su categorie mal definite. In
La variazione di registro nella comunicazione elettronica,
Massimo Cerruti,
Elisa Corino &
Christina Onesti (eds), 15–35. Roma: Carocci.

Biber, Douglas & Conrad, Susan
2009 Register variation: A corpus approach. In
The Handbook of Discourse Analysis,
Deborah Schiffrin,
Deborah Tannen &
Heidi E. Hamilton (eds), 175–196. Oxford: Blackwell.

Biber, Douglas, Conrad, Susan & Reppen, Randi
1998 Corpus linguistics: Investigating language structure and use. Cambridge: CUP.


Biber, Douglas, Johansson, Stig, Leech, Geoffrey, Conrad, Susan & Finegan, Edward
1999 The Longman Grammar of Spoken and Written English. London: Longman.

Chomsky, Noam
1970 Remarks on nominalization. In
Readings in English Transformational Grammar,
Roderick A. Jacobs &
Peter S. Rosenbaum (eds), 184–221. Waltham MA: Blaisdell.

Cresti, Emanuela
2000 Corpus di italiano parlato, 2 Vols. Firenze: Accademia della Crusca.

Cresti, Emanuela
2001 Per una nuova definizione di frase. In
Studi di storia della lingua italiana offerti a Ghino Ghinassi,
Paolo Bongrani,
Andrea Dardi,
Massimo Fanfani &
Riccardo Tesi (Eds.), 511–550. Firenze: Le Lettere.

Cresti, E
2005a Notes on lexical strategy, structural strategy and surface clause indexes in the C-ORAL-ROM spoken corpora. In
Cresti &
Moneglia (eds), 209–256.

Cresti, Emanuela
2005b Enunciato e frase: Teoria e verifiche empiriche. In
Italia linguistica: Discorsi di scritto e di parlato. Nuovi studi di linguistica italiana per Giovanni Nencioni,
Marco Biffi,
Omar Calabrese &
Luciana Salibra (eds), 249–260. Siena: Protagon.

Cresti, Emanuela & Gramigni, Paola
2004 Per una linguistica corpus based dell’italiano parlato: Le unità di riferimento. In
Atti del Convegno ‘L’italiano parlato’,
Federico Leoni Albano,
Francesco Cutugno,
Massimo Pettorino &
Renata Savy (eds). Napoli: D’Auria.

Cresti, Emanuela & Moneglia, Massimo
Cresti, Emanuela & Raso, Tommaso
2012. Text annotation of information units through IPIC. LABLITA
[URL]
Dittmar, Norbert
2004 Register. In
Handbuch der Soziolinguistik / Handbook of Sociolinguistics, Vol.1,
Ulrich Ammon,
Norbert Dittmar,
Klaus J. Mattheier &
Peter Trudgill (eds), 2016–226. Berlin: De Gruyter.

Du Bois, John W., Chafe, Wallace L., Meyer, Charles, Thompson, Sandra A., Englebretson, Robert & Martey, Nii
2000–2005 Santa Barbara Corpus of Spoken American English, Parts 1–4. Philadelphia PA: Linguistic Data Consortium.

Edwards, Jane A
1993 Principles and contrasting systems of discourse transcription. In
Talking data: Transcription and coding in discourse research.
Jane A. Edwards &
Martin D. Lampert (eds), 3–31. Hillsdale NJ: Lawrence Erlbaum Associates.

Firenzuoli, Valentina
2003 Le forme intonative di valore illocutivo dell’italiano parlato: Analisi sperimentale di un crpus di parlato spontaneo (LABLITA). PhD dissertation, University of Florence.

Fleiss, Joseph L
1971 Measuring nominal scale agreement among many raters.
Psychological Bulletin 76(5): 378–382.


Fogassi, Leonardo & Ferrari Pier Francesco
Gadet, F
2000 Vers une sociolinguistique des locuteurs.
Sociolinguististica 14: 99–103.

Gadet, Françoise
2003 La variation sociale en français. Paris: Ophrys.

Gregori, Lorenzo & Panunzi, Allesandro
2012
DB-IPIC: An XML database for informational patterning analysis
. In
Proceedings of the 7th GSCP International Conference. Speech and Corpora
,
Heliana Mello,
Massimo Pettorino &
Tommaso Raso (eds), 121–127. Florence: Firenze University Press.

Halliday, Michael A.K
1989 Spoken and Written Languages. Oxford: OUP.

van den Heuvel, Henk, Boves, Louis, Choukri, Khalid, Goddijn, Simo & Sanders, Eric
2000 SLR validation: Present state of affairs and prospects. In
Proceedings of the 2nd International Conference on Language Resource and Evaluation (LREC 2000)
, 435–440. Paris: ELRA.

Izre’el, Shlomo, Hary, Benjamin & Rahav, Giora
Johansson, Stig
1995a The approach of the Text Encoding Initiative to the encoding of spoken discourse. In
Leech,
Meyers &
Thomas (eds), 82–98.

Johansson, Stig
1995b The encoding of spoken texts.
Computers and the Humanities 29(1): 149–158. Also in
Ide, Nancy &
Véronis, Jean 1995
The Text Encoding Initiative. Background and Context, 149–158. Dordrecht: Kluwer.


Karcevsky, Serge
1931 Sur la phonologie de la phrase.
Travaux du Cercle Linguistique de Prague IV: 188–228.

Labov, William
1966 The Social Stratification of English in New York City. Washington DC: Center for Applied Linguistics.

Labov, William & Waletzky, Joshua
1967 Narrative analysis. In
Essays on the Verbal and Visual Arts,
June Helm (ed.), 12–44. Seattle, WA: University of Washington Press.

Leech, Geoffrey, Myers, Greg & Thomas, Jenny
(eds) 1995 Spoken English on Computer. Transcription, Markup and Applications. Harlow: Longman.

Llisterri, Joaquim
1996 Preliminary recommendations on spoken texts. EAGLES Documents EAG-TCWG-STP/P.
[URL]
MacWhinney, Brian J
2000 The CHILDES Project: Tools for Analyzing Talk. Mahwah NJ: Lawrence Erlbaum Associates.

Mc Neill, David
(ed.) 2000 Language and Gesture. Cambridge: CUP.


Mc Neill, David
2012 How Language Began. Cambridge: CUP.


Mello, Heliana & Raso, Tommaso
2009 Para a transcrição da fala espontânea: O caso do C-ORAL-BRASIL.
Revista Portuguesa de Humanidades – Estudos Linguísticos 13(1): 153–178.

Mello, Heliana, Raso, Tommaso, Mittmann, Maryualê M., Vale, Heloisa P. & Côrtes, Priscila O
2012 Transcrição e segmentação prosodic do corpus C-ORAL-BRASIL: Critérios de implementação e validação. In
C-ORAL – Brasil I: Corpus de referência do português brasileiro falado informal,
Tommaso Raso &
Heliana Mello (eds), 125–176. Belo Horizonte: Editora UFMG.

Mello, Heliana, Raso, Tommaso, Mittmann, Maryualê M. & Furtado, D
DBCom: C-ORAL-BRASIL search engine platform. Forthcoming.
Mettouchi, Amina, Lacheret-Dujour, Anne, Silber-Varod, Vered, Izre’el, Shlomo
2007 Only prosody? Perception of speech segmentation in Kabyle and Hebrew.
Nouveaux Cahiers de Linguistique Française 28: 207–218.

Mettouchi, Amina, Caubet, Dominique, Vanhove, Martine, Tosco, Mauro, Comrie, Bernard & Izre’el, Shlomo
2010 CORPAFROAS. A corpus for spoken Afroasiatic languages: Morphosyntactic and prosodic analysis. In
CAMSEMUD 2007,
Frederick Mario Fales &
Giulia Francesca Grassi (eds), 177–180. Padova: SARGON.

Moneglia, Massimo
2011 Spoken corpora and pragmatics.
Revista Brasileira de Linguística Aplicada 11(2): 479–519.


Moneglia, Massimo & Cresti, Emanuela
1997 L’intonazione e I criteri di trascrizione del parlato adulto e infantile. In
Il progettto CHILDES Italia,
Umberta Bortolini &
Elen Pizzuto (eds), 57–90. Pisa: Del Cerro.

Moneglia, Massimo, Scaarano, Antonietta & Spinu, Marius
2005 The multilingual corpus of spontaneous speech C-ORAL-ROM: Validation of the prosodic annotation by expert transcribers. In
Atti della Conferenza CLiP 2003
,
Carlotta Nicolas Martinez &
Massimo Moneglia (eds), 127–142. Firenze: Firenze University Press.

Moneglia, Massimo & Scarano, Antonietta
2008 Il Corpus Stammerjohann. Il primo corpus di italiano parlato, in rete nella base dati di LABLITA. In
Atti del convegno internazionale ‘La comunicazione parlata’, Tomo III,
Massimo Pettorino (ed.), 1650–1685. Napoli: Liguori.

Moneglia, Massimo & Cresti, Emanuela
Forthcoming.
The cross-linguistic comparison of information patterning in spontaneous speech corpora: Data from C-ORAL-ROM ITALIAN and C-ORAL-BRASIL. In
Linguistique interactionnelle contrastive. Grammaire et interaction dans les langues romanes,
Sabine Diao-Klaeger &
Britta Thörle (eds) Tübingen Stauffenburg
Nencioni, Giovanni
1976 Parlato-parlato, parlato-scritto, parlato-recitato.
Strumenti Critici 10: 1–56. Also in
Nencioni, Giovanni 1983
Di scritto e parlato. Discorsi linguistici, 126–179. Bologna: Zanichelli.

Oostdijk, Nelleke, Goedertier, Wim, Van Eynde, Frank, Boves, Louis, Martens, Jean-Pierre, Moortgat, Michael, Baayen, R. Harald
2002 Experiences from the Spoken Dutch Corpus Project. In
Proceedings from the Third International Conference on Language Resources and Evaluations
,
Manuel Gonzalez-Rodriguez &
Carmen Paz Suárez Araujo (eds), 330–347. Las Palmas de Gran Canaria.
Panunzi, Allesandro & Gregori, Lorenzo
2012 DB-IPIC. An XML database for the representation of information structure in spoken language. In
Pragmatics and Prosody. Illocution, Modality, Attitude, Information Structure and Speech Annotation,
Heliana Mello,
Allesandro Panunzi &
Tommaso Raso (eds), 19–37. Florence: Firenze University Press.

Poggi, Isabella
2007 Mind, Hands, Face and Body. A Goal and Belief View of Multimodal Communication. Berlin: Werdler.

Raso, Tommaso
2012a O corpus C-ORAL-BRASIL. In
Raso &
Mello (eds), 55–90.

Raso, Tommaso
2012b O C-ORAL-BRASIL e a teoria da língua em ato. In
Raso &
Mello (eds), 91–124.

Raso, Tommaso
2012c Specifications. In
Mello &
Raso (eds).

Raso, Tommaso
In press.
Fala e escrita: Meio, canal, consequências pragmáticas e linguísticas.
Domínios da Linguagem.

Raso, Tommaso & Mello, Heliana
(eds) 2012 C-ORAL – Brasil I: Corpus de referência do português brasileiro falado informal. Belo Horizonte: Editora UFMG.

Raso, Tommaso & Mittmann, Maryualê M
2009 Validação estatística dos critérios de segmentação da fala espontânea no corpus C-ORAL-BRASIL.
Revista de Estudos da Linguagem 17(2): 73–91.


Raso, Tommaso & Mittmann, Maryualê M
2012 As principais medidas da fala. In
Raso &
Mello (eds).

Rocha, Bruno
2013 Metodologia emírica para o estudo de ilocuções no PB.
Domínios de Linguagem 14: 109–148.

Rossi, Fabio
2001 Varietà diamesica. In
Enciclopedia dell’italiano, 1540–1542. Roma: Treccani.

Rossini, Nicla
2012 Language ‘in action’: Reinterpreting Gesture as Language. Amsterdam: IOS Press.

Scarano, Antonietta
2004 Enunciati nominali in un corpus di italiano parlato. Appunti per una grammatica corpus based. In
Atti del Convegno ‘L’italiano parlato’,
Federico Leoni Albano,
Francesco Cutugno,
Massimo Pettorino &
Renata Savy (eds). Napoli: D’Auria.

Schiel, Florian, Baumann, Angela, Draxler, Christoph, Ellbogen, Tania, Hoole, Phil & Steffen, Alexander
2004 The Validation of Speech Corpora. Munich: University of Munich.

Signorini, Sabrina & Tucci, Ida
2004 Il restauro e l’ archiviazione elettronica del primo corpus di italiano parlato: Il corpus Stammerjohann. In
Costituzione, Gestione e restauro di corpora vocali, Atti delle XIV Giornate del GFS, Collana degli atti dell’associazione italiana di acustica. Viterbo, 4–6 dicembre 2003,
Amedeo De Dominicis,
Laura Mori &
Marianna Stefani (eds), 119–126. Roma: Esagrafica.


Sinclair, John
1996.
Preliminary recommendations on corpus typology. EAGLES Document EAG-TCWG-CTYP/P.
[URL]
Stam, Gale & Ishino, Mika
Teubert, Wolfgang
1993 Phonetic / Phonemic and Prosodic Annotation. NERC-WP 8-171. Mannheim: IDS.

Thompson, Paul
2005 Spoken language corpora. In
Developing Linguistic Corpora: A Guide to Good Practice,
Martin Wynne (ed.), 59–70. Oxford: Oxbow Books.

Winski, Richard, Moore, Roger & Gibbon, Dafydd
1995 EAGLES Spoken Language Working Group: Overview and results. In
Eurospeech’95. Proceedings of the 4th European Conference on Speech Communication and Speech Technology
, 18–21 September, Vol 1, 841–844. Madrid, Spain.
Woodbury, A
2003 Defining documentary linguistics. In
Language Documentation and Description, 1: HRELP,
Peter Austin (ed.). London: SOAS.

audio
Example 1
Example 2
Example 3
Example 4
Example 5
Example 6
Example 7
Example 8
Example 9
Example 10
Example 11
Example 12
Example 13
Example 14
Example 15
Example 16
Cited by
Cited by 2 other publications
Bossaglia, Giulia, Heliana Mello & Tommaso Raso
Cresti, Emanuela
This list is based on CrossRef data as of 12 march 2023. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.