The grammatical annotation of speech corpora
Techniques and perspectives
This chapter discusses the grammatical annotation of speech corpora on the one hand (C-ORAL-Brasil, NURC) and speech-like text on the other (e-mail, chat, tv-news, parliamentary discussions), drawing on Portuguese data for the former and English data for the latter. We try to identify and compare linguistic orality markers (“speechlikeness”) in different genres, and argue that broad-coverage Constraint Grammar parsers such as PALAVRAS and EngGram can be adapted to these features, and used across the text-speech divide. Special topics include emoticons, phonetic variation and syntactic features. For ordinary speech corpora we propose a system of two-level annotation, where overlaps, retractions and phonetic variation are maintained as meta-tagging, while allowing conventional annotation of an orthographically normalized textual layer. In the absence of punctuation, syntactic segmentation can be achieved by exploiting prosodic breaks as delimiters in parsing rules. With the exception of chat data, our modified “oral” CG parsers perform reasonably close to their written language counterparts, even for true transcribed speech, achieving accuracy rates (F-scores) above 98% for PoS tags and 93–95% for syntactic function.
References
Bick, Eckhard
1998 Tagging speech data. Constraint grammar analysis of spoken Portuguese. In
Proceedings of the 17th Scandinavian Conference of Linguistics
. Odense: Odense University

Bick, Eckhard
2000 The Parsing System PALAVRAS. Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Framework. Aarhus: Aarhus University Press

Bick, Eckhard & Módolo, Marcelo
2005 Letters and editorials: A grammatically annotated corpus of 19th century Brazilian Portuguese. In
Romance Corpus Linguistics, II: Corpora and Historical Linguistics (Proceedings of the 2nd Freiburg Workshop on Romance Corpus Linguistics, Sept. 2003)
,
Claus Pusch &
Johannes Kabatek &
Wolfgang Raible (eds), 271–280. Tübingen: Gunther Narr.

Bick, Eckhard
2009 Introducing probabilistic information in constraint grammar parsing. In
Proceedings of Corpus Linguistics 2009
,
Liverpool, UK
.
[URL]
Brill, Eric
1992 A simple rule-based part of speech tagger. In
Proceedings of the Workshop on Speech and Natural Language, HLT ‘91
, 112–116. Morristown NJ: ACL.

de Castilho, Ataliba
(ed.) 1993 Gramática do Português Falado, Vol.3, Campinas: Editora da Unicamp.

DeLiema, David, Steen, Francis & Turner, Mark
2012 Language, gesture and audiovisual communication: A massive online database for researching multimodal constructions. Lecture, 11th Conceptual Structure, Discourse and Language Conference, Vancouver, May 17–20.
Johannessen, Janne Bondi, Priestley, Joel, Hagen, Kristin, Åfarli, Tor Anders & Vangsnes, Øystein Alexander
2009 The Nordic Dialect Corpus – An advanced research tool. In
Proceedings of the 17th Nordic Conference of Computational Linguistics (NODALIDA 2009)
[NEALT Proceedings Series 4],
Kristiina Jokinen &
Eckhard Bick (eds). Odense: University of Odense.

Karlsson, Fred, Voutilainen, Atro, Heikkilä, Juka & Anttila, Arto
1995 Constraint Grammar, A Language-Independent System for Parsing Unrestricted Text. Berlin: Mouton de Gruyter.


Klimt, Brian & Yang, Yiming
2004 Introducing the Enron Corpus. In
First Conference on Email and Anti-Spam (CEAS)
,
Mountain View, CA
.
[URL] (29 May 2010).
Luz, Saturnino, Masoodian, Masood, Rogers, Bill & Deering, Chris
2008 Interface design strategies for computer-assisted speech transcription. In
Proceedings of the 20th Australasian Conference on Computer-Human Interaction
,
Cairns, Australia
, 203–210. New York NY: ACM.

Maamouri, Mohamed, Bies, Ann, Kulick, Seth, Zaghouani, Wajdi, Graff, Dave & Ciul, Mike
2010 From speech to trees: Applying treebank annotation to Arabic broadcast news. In
Proceedings of LREC 2010,
Valletta, Malta
.
Moreno, Atonio & Guirão, José M
2003 Tagging a spontaneous speech corpus of Spanish. In
Proceedings of the International Conference on Recent Advances in Natural Language Processing,
Borovets, Bulgaria
, 292–296.
Müürisep, Kaili & Uibo, Heli
2006 Shallow parsing of spoken Estonian using constraint grammar. In
Proceedings of NODALIDA-2005 – Special Session on Treebanking
[Copenhagen Studies in Language 33],
Peter Juel Henriksen &
Peter Rossen Skadhauge (eds).
Panunzi, Allesandro, Picchi, Eugenio & Moneglia, Massimo
2004 Using PiTagger for lemmatization and PoS tagging of a spontaneous speech corpus: C-Oral-Rom Italian. In
Proceedings of the 4th LREC Conference
, Vol. 2,
Maria Teresa Lino,
Maria Francisca Xavier,
Fátima Ferreira,
Rute Costa &
Raquel Silva (eds), 563–566. Paris: ELRA.

Raso, Tommaso & Heliana Mello
2010 The C-ORAL BRASIL corpus. In
Bootstrapping Information from Corpora in a Cross-Linguistic Perspective,
Massimo Moneglia &
Alessandro Panunzi (eds). Florence: Universitá degli studi di Firenze, Biblioteca Digitale.

Raso, Tommaso & Heliana Mello
2012 C-ORAL-BRASIL I: Corpus de referência da fala informal brasileira. Belo Horizonte: Editora UFMG.

Schmid, Helmut
1994 Probabilistic part-of-speech tagging using decision trees. In
Proceedings of the International Conference on New Methods in Language Processing 1994
, 44–49. Manchester: University of Manchester.

Cited by
Cited by 1 other publications
Gut, Ulrike
2020.
Spoken Corpora. In
A Practical Handbook of Corpus Linguistics,
► pp. 235 ff.

This list is based on CrossRef data as of 12 march 2023. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.