Article published in:Compilation, transcription, markup and annotation of spoken corpora
Edited by John M. Kirk and Gisle Andersen
[International Journal of Corpus Linguistics 21:3] 2016
► pp. 372–395
Accounting for ELF
Categorising the unconventional in POS-tagging the VOICE corpus
This paper reports on some issues encountered when using various ‘external points of reference’ in the development of POS-tagging guidelines for the Vienna-Oxford International Corpus of English (VOICE). VOICE is a corpus of spoken English as a Lingua Franca (ELF) containing naturally occurring, plurilingual data. As in all kinds of natural language use, speakers recorded in VOICE exploit available linguistic resources, often resulting in non-codified language use and language which is difficult to classify unambiguously. However, detailed tagging solutions for such phenomena are rarely reported. We discuss usefulness and limitations of external points of reference with regard to their suitability for POS-tagging VOICE and address methodological as well as practical issues, especially the handling of non-codified language use and different types of ambiguities. We suggest that the solutions found, and the theoretical approach adopted, could be relevant for the tagging of other spoken corpora.
Keywords: POS-tagging standards, spoken corpora, English as a lingua franca (ELF), ambiguities, non-codified language use
Published online: 29 September 2016
Beal, J., Corrigan, K., Smith N., & Rayson P.
(2006) Writing the vernacular: Transcribing and tagging the Newcastle Electronic Corpus of Tyneside English (NECTE). In A. Meurman-Solin & A. Nurmi, Studies in Variation Contacts and Change: Annotating Variation and Change. Helsinki: VARIENG. Retrieved from http://www.helsinki.fi/varieng/journal/volumes/01/beal_et_al/ (last accessed November 2015).
Biber, D., Johansson, S., Leech, G., Conrad S., & Finegan, E.
Breiteneder, A., Klimpfinger, T., Majewski S., & Pitzl, M-L.
Breiteneder, A., Pitzl, M-L., Majewski S., & Klimpfinger, T.
Carter, R., & McCarthy, M.
(1996) Recommendations for the Morphosynatctic Annotation of Corpora. Retrieved from http://www.ilc.cnr.it/EAGLES/browse.html (last accessed March 2014).
Greenbaum, S., & Ni, Y.
Hirschmann, H., Doolittle S., & Lüdeling, A.
(2007) Syntactic annotation of non-canonical linguistic structures. In M. Davies, P. Rayson, S. Hunston & P. Danielsson (Eds.), Proceedings of the Corpus Linguistics Conference CL2007, University of Birmingham, UK, 27–30 July 2007 (pp. 1–15). Retrieved from http://ucrel.lancs.ac.uk/publications/CL2007/paper/128_Paper.pdf (last accessed October 2012).
Hudson-Ettle, D.M., & Schmied, J.
(1999) Manual to accompany The East African Component of The International Corpus of English ICE-EA: Background information, coding conventions and lists of source texts. Retrieved from http://clu.uni.no/icame/manuals/ICE_EA.PDF (last accessed January 2015).
(2009) ‘We don’t take the right way. We just take the way that we think you will understand’: The shifting relationship between correctness and effectiveness in ELF. In A. Mauranen & E. Ranta (Eds.), English as a Lingua Franca: Studies and Findings (pp. 323–347). Newcastle upon Tyne: Cambridge Scholars Publishing.
(2009) Collecting spoken learner data: Challenges and benefits. In M. Mahlberg, V. González-Díaz & C. Smith (Eds.), Proceedings of the Corpus Linguistics Conference CL2009. Liverpool, 20–23 July 2009, University of Liverpool, UK. Retrieved from ucrel.lancs.ac.uk/publications/cl2009/230_FullPaper.doc (last accessed March 2014).
Jendryczka-Wierszycka, J., Rayson P., & Hoffmann, S.
(2009) Spoken learner corpus & its POS tagging. Retrieved from http://www.ling.lancs.ac.uk/groups/crg/files/CRG09_wk30_JJW_slides.pdf (last accessed March 2014).
(2007) Clause boundary detection in transcribed spoken language. In J. Nivre, H-J. Kaalep, K. Muischnek & M. Koit (Eds.), Proceedings of the 16th Nordic Conference of Computational Linguistics NODALIDA-2007, University of Tartu, Tartu, 235–239. Retrieved from http://folk.uio.no/fredrijo/publications/pdf/Joer07.pdf (last accessed January 2013).
(2005) Adding linguistic annotation. In M. Wynne (Ed.), Developing Linguistic Corpora: A Guide to Good Practice (online). Oxford: Oxbow Books. Retrievable from http://www.ahds.ac.uk/creating/guides/linguistic-corpora/chapter2.htm (last accessed June 2016).
Leech, G., Garside R., & Bryant, M.
Linguistic Data Consortium (LDC)
(1999) Addendum to the Part-of-Speech Tagging Guidelines for the Penn Treebank Project (Modifications for the SwitchBoard corpus). Retrieved from http://www.cis.upenn.edu/~bies/manuals/tagguid2.pdf (accessed March 2014).
(2009) Enriching CHILDES for morphosyntactic analysis. Carnegie Mellon University, Pittsburgh, PA. Retrieved from http://repository.cmu.edu/cgi/viewcontent.cgi?article=1174&context=psychology (last accessed June 2016).
(2012) The CHILDES Project. Tools for analyzing talk – Electronic edition. Part 1: The CHAT transcription format. Carnegie Mellon University. Retrieved from http://childes.psy.cmu.edu/manuals/chat.pdf (last accessed February 2013).
Meurers, D., & Wunsch, H.
(2010) Linguistically annotated learner corpora: Aspects of a layered linguistic encoding and standardized representation. In Proceedings of Linguistic Evidence , 1–4. Retrieved from http://www.sfs.uni-tuebingen.de/~dm/papers/meurers-wunsch-10.pdf (last accessed December 2012).
Nivre, J., & Grönqvist, L.
Oxford Advanced Learner’s Dictionary of Current English
(2013) Applying existing tagging practices to VOICE. In M. Joybrato & M. Huber (Eds.), Corpus Linguistics and Variation in English: Focus on Non-Native Englishes. Helsinki: VARIENG. Retrieved from http://www.helsinki.fi/varieng/series/volumes/13/osimk-teasdale/ (last accessed March 2014).
Pitzl, M-L., Breiteneder A., & Klimpfinger, T.
Quirk, R., Greenbaum, S., Leech G., & Svartvik, J.
Rahman, A., & Sampson, G.
(2000) Extending grammar annotation standards to spontaneus speech. In J.M. Kirk. (Ed.), Corpora Galore: Analyses and Techniques in Describing English. Papers from the Nineteenth International Conference on English Language Research on Computerised Corpora (ICAME 1998). (pp. 295–311). Amsterdam/Atlanta: Rodopi.
(1991) Part of Speech Tagging Guidelines for the Penn Treebank Project. Retrieved from http://www.personal.psu.edu/xxl13/teaching/sp07/apling597e/resources/Tagset.pdf (last accessed March 2014).
(2000) CHRISTINE Corpus: Documentation. Retrieved from http://www.grsampson.net/ChrisDoc.html (last accessed February 2013).
van Eynde, F., Zavrel J., & Daelemans, W.
(2000.) Part of speech tagging and lemmatisation for the Spoken Dutch Corpus. In Proceedings of the Second International Conference on Language Resources and Evaluation (LREC 2000) , Athens, Greece, (pp. 1427–1434). Retrieved from http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.33.1151&rep=rep1&type=pdf (last accessed March 2014).
(2013a) Availability. Retrieved from http://www.univie.ac.at/voice/page/corpus_availability (last assessed March 2014).
(2013b) Corpus Information. Retrieved from http://www.univie.ac.at/voice/page/corpus_information (accessed March 2014).
(2013c) VOICE part-of-speech tagging and lemmatization manual. Retrieved from http://www.univie.ac.at/voice/documents/VOICE_tagging_manual.pdf (accessed March 2014).
(Ed) (2005) Developing Linguistic Corpora: A Guide to Good Practice. Oxford: Oxbow Books. Retrievable from http://www.ahds.ac.uk/creating/guides/linguistic-corpora/ (last accessed 3 June 2016).