Techniques and perspectives: The grammatical annotation of speech corpora

Bick, Eckhard

doi:10.1075/scl.61.04bic

Part of

Spoken Corpora and Linguistic Studies
Edited by Tommaso Raso and Heliana Mello
[Studies in Corpus Linguistics 61] 2014
► pp. 105–128

The grammatical annotation of speech corpora

Techniques and perspectives

Eckhard Bick

This chapter discusses the grammatical annotation of speech corpora on the one hand (C-ORAL-Brasil, NURC) and speech-like text on the other (e-mail, chat, tv-news, parliamentary discussions), drawing on Portuguese data for the former and English data for the latter. We try to identify and compare linguistic orality markers (“speechlikeness”) in different genres, and argue that broad-coverage Constraint Grammar parsers such as PALAVRAS and EngGram can be adapted to these features, and used across the text-speech divide. Special topics include emoticons, phonetic variation and syntactic features. For ordinary speech corpora we propose a system of two-level annotation, where overlaps, retractions and phonetic variation are maintained as meta-tagging, while allowing conventional annotation of an orthographically normalized textual layer. In the absence of punctuation, syntactic segmentation can be achieved by exploiting prosodic breaks as delimiters in parsing rules. With the exception of chat data, our modified “oral” CG parsers perform reasonably close to their written language counterparts, even for true transcribed speech, achieving accuracy rates (F-scores) above 98% for PoS tags and 93–95% for syntactic function.

Published online: 14 November 2014

https://doi.org/10.1075/scl.61.04bic

References (18)

Bick, Eckhard. 1998. Tagging speech data. Constraint grammar analysis of spoken Portuguese. In Proceedings of the 17th Scandinavian Conference of Linguistics . Odense: Odense University

. 2000. The Parsing System PALAVRAS. Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Framework. Aarhus: Aarhus University Press

Bick, Eckhard & Módolo, Marcelo. 2005. Letters and editorials: A grammatically annotated corpus of 19th century Brazilian Portuguese. In Romance Corpus Linguistics, II: Corpora and Historical Linguistics (Proceedings of the 2nd Freiburg Workshop on Romance Corpus Linguistics, Sept. 2003) , Claus Pusch & Johannes Kabatek & Wolfgang Raible (eds), 271–280. Tübingen: Gunther Narr.

Bick, Eckhard. 2009. Introducing probabilistic information in constraint grammar parsing. In Proceedings of Corpus Linguistics 2009 , Liverpool, UK . [URL]

Brill, Eric. 1992. A simple rule-based part of speech tagger. In Proceedings of the Workshop on Speech and Natural Language, HLT ‘91 , 112–116. Morristown NJ: ACL.

de Castilho, Ataliba (ed.). 1993. Gramática do Português Falado, Vol.3, Campinas: Editora da Unicamp.

DeLiema, David, Steen, Francis & Turner, Mark. 2012. Language, gesture and audiovisual communication: A massive online database for researching multimodal constructions. Lecture, 11th Conceptual Structure, Discourse and Language Conference, Vancouver, May 17–20.

Johannessen, Janne Bondi, Priestley, Joel, Hagen, Kristin, Åfarli, Tor Anders & Vangsnes, Øystein Alexander. 2009. The Nordic Dialect Corpus – An advanced research tool. In Proceedings of the 17th Nordic Conference of Computational Linguistics (NODALIDA 2009) [NEALT Proceedings Series 4], Kristiina Jokinen & Eckhard Bick (eds). Odense: University of Odense.

Karlsson, Fred, Voutilainen, Atro, Heikkilä, Juka & Anttila, Arto. 1995. Constraint Grammar, A Language-Independent System for Parsing Unrestricted Text. Berlin: Mouton de Gruyter.

Klimt, Brian & Yang, Yiming. 2004. Introducing the Enron Corpus. In First Conference on Email and Anti-Spam (CEAS) , Mountain View, CA . [URL] (29 May 2010).

Luz, Saturnino, Masoodian, Masood, Rogers, Bill & Deering, Chris. 2008. Interface design strategies for computer-assisted speech transcription. In Proceedings of the 20th Australasian Conference on Computer-Human Interaction , Cairns, Australia , 203–210. New York NY: ACM.

Maamouri, Mohamed, Bies, Ann, Kulick, Seth, Zaghouani, Wajdi, Graff, Dave & Ciul, Mike. 2010. From speech to trees: Applying treebank annotation to Arabic broadcast news. In Proceedings of LREC 2010, Valletta, Malta .

Moreno, Atonio & Guirão, José M. 2003. Tagging a spontaneous speech corpus of Spanish. In Proceedings of the International Conference on Recent Advances in Natural Language Processing, Borovets, Bulgaria , 292–296.

Müürisep, Kaili & Uibo, Heli. 2006. Shallow parsing of spoken Estonian using constraint grammar. In Proceedings of NODALIDA-2005 – Special Session on Treebanking [Copenhagen Studies in Language 33], Peter Juel Henriksen & Peter Rossen Skadhauge (eds).

Panunzi, Allesandro, Picchi, Eugenio & Moneglia, Massimo. 2004. Using PiTagger for lemmatization and PoS tagging of a spontaneous speech corpus: C-Oral-Rom Italian. In Proceedings of the 4th LREC Conference , Vol. 2, Maria Teresa Lino, Maria Francisca Xavier, Fátima Ferreira, Rute Costa & Raquel Silva (eds), 563–566. Paris: ELRA.

Raso, Tommaso & Heliana Mello. 2010. The C-ORAL BRASIL corpus. In Bootstrapping Information from Corpora in a Cross-Linguistic Perspective, Massimo Moneglia & Alessandro Panunzi (eds). Florence: Universitá degli studi di Firenze, Biblioteca Digitale.

. 2012. C-ORAL-BRASIL I: Corpus de referência da fala informal brasileira. Belo Horizonte: Editora UFMG.

Schmid, Helmut. 1994. Probabilistic part-of-speech tagging using decision trees. In Proceedings of the International Conference on New Methods in Language Processing 1994 , 44–49. Manchester: University of Manchester.

Cited by (2)

Cited by two other publications

Raso, Tommaso, Bruno Neves Rati de Melo Rocha, João Vinícius Salgado, Breno Fiuza Cruz, Lucas Machado Mantovani & Heliana Mello

2023. The C-ORAL-ESQ project: a corpus for the study of spontaneous speech of individuals with schizophrenia. Language Resources and Evaluation

Gut, Ulrike

2020. Spoken Corpora. In A Practical Handbook of Corpus Linguistics, ► pp. 235 ff.

This list is based on CrossRef data as of 25 july 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.