Accounting for ELF: Categorising the unconventional in POS-tagging the VOICE corpus

Osimk-Teasdale, Ruth; Dorn, Nora

doi:10.1075/ijcl.21.3.04osi

Article published In:

Compilation, transcription, markup and annotation of spoken corpora
Edited by John M. Kirk and Gisle Andersen
[International Journal of Corpus Linguistics 21:3] 2016
► pp. 372–395

Accounting for ELF

Categorising the unconventional in POS-tagging the VOICE corpus

Ruth Osimk-Teasdale | University of Vienna

Nora Dorn

This paper reports on some issues encountered when using various ‘external points of reference’ in the development of POS-tagging guidelines for the Vienna-Oxford International Corpus of English (VOICE). VOICE is a corpus of spoken English as a Lingua Franca (ELF) containing naturally occurring, plurilingual data. As in all kinds of natural language use, speakers recorded in VOICE exploit available linguistic resources, often resulting in non-codified language use and language which is difficult to classify unambiguously. However, detailed tagging solutions for such phenomena are rarely reported. We discuss usefulness and limitations of external points of reference with regard to their suitability for POS-tagging VOICE and address methodological as well as practical issues, especially the handling of non-codified language use and different types of ambiguities. We suggest that the solutions found, and the theoretical approach adopted, could be relevant for the tagging of other spoken corpora.

Keywords: POS-tagging standards, spoken corpora, English as a lingua franca (ELF), ambiguities, non-codified language use

Published online: 29 September 2016

https://doi.org/10.1075/ijcl.21.3.04osi

References (43)

Atwell, E. (2008). Development of tag sets for part-of-speech tagging. In A. Lüdeling & M. Kytö (Eds.), Corpus Linguistics: An International Handbook. Volume 11 (pp. 501–527). Berlin/New York: Walter de Gruyter.

Beal, J., Corrigan, K., Smith N., & Rayson P. (2006) Writing the vernacular: Transcribing and tagging the Newcastle Electronic Corpus of Tyneside English (NECTE). In A. Meurman-Solin & A. Nurmi, Studies in Variation Contacts and Change: Annotating Variation and Change. Helsinki: VARIENG. Retrieved from [URL] (last accessed November 2015).

Biber, D., Johansson, S., Leech, G., Conrad S., & Finegan, E. (1999). Longman Grammar of Spoken and Written English. Harlow: Longman.

Breiteneder, A., Klimpfinger, T., Majewski S., & Pitzl, M-L. (2009). The Vienna-Oxford International Corpus of English (VOICE): A linguistic resource for exploring English as a lingua franca. ÖGAI-Journal, 28(1), 21–26.

Breiteneder, A., Pitzl, M-L., Majewski S., & Klimpfinger, T. (2006). VOICE recording: Methodological challenges in the compilation of a corpus of spoken ELF. Nordic Journal of English Studies, 5(2), 161–188.

Carter, R., & McCarthy, M. (2006). Cambridge Grammar of English: A Comprehensive Guide to Spoken and Written English Usage. Cambridge: Cambridge University Press.

Cook, V. (2002). Background to the L2 User. In V. Cook (Ed.), Portraits of the L2 User (pp. 1–28). Clevedon: Multilingual Matters.

EAGLES. (1996). Recommendations for the Morphosynatctic Annotation of Corpora. Retrieved from [URL] (last accessed March 2014).

Garside, R. (1995). Grammatical tagging of the spoken part of the British National Corpus: A Progress Report. In G. Leech, G. Myers & J. Thomas (Eds.), Spoken English on Computer (pp. 161–167). London: Longman.

Greenbaum, S., & Ni, Y. (1994). Tagging the British ICE Corpus: English Word Classes. In N. Oostdijk & P. de Haan (Eds.), Corpus-based Research into Language. In Honour of Jan Aarts (pp. 33–45). Amsterdam: Rodopi.

Hirschmann, H., Doolittle S., & Lüdeling, A. (2007). Syntactic annotation of non-canonical linguistic structures. In M. Davies, P. Rayson, S. Hunston & P. Danielsson (Eds.), Proceedings of the Corpus Linguistics Conference CL2007, University of Birmingham, UK, 27–30 July 2007 (pp. 1–15). Retrieved from [URL] (last accessed October 2012).

Hudson-Ettle, D.M., & Schmied, J. (1999). Manual to accompany The East African Component of The International Corpus of English ICE-EA: Background information, coding conventions and lists of source texts. Retrieved from [URL] (last accessed January 2015).

Hülmbauer, C. (2009). ‘We don’t take the right way. We just take the way that we think you will understand’: The shifting relationship between correctness and effectiveness in ELF. In A. Mauranen & E. Ranta (Eds.), English as a Lingua Franca: Studies and Findings (pp. 323–347). Newcastle upon Tyne: Cambridge Scholars Publishing.

Jendryczka-Wierszycka, J. (2009). Collecting spoken learner data: Challenges and benefits. In M. Mahlberg, V. González-Díaz & C. Smith (Eds.), Proceedings of the Corpus Linguistics Conference CL2009. Liverpool, 20–23 July 2009, University of Liverpool, UK. Retrieved from [URL] (last accessed March 2014).

Jendryczka-Wierszycka, J., Rayson P., & Hoffmann, S. (2009). Spoken learner corpus & its POS tagging. Retrieved from [URL] (last accessed March 2014).

Jørgensen, F. (2007). Clause boundary detection in transcribed spoken language. In J. Nivre, H-J. Kaalep, K. Muischnek & M. Koit (Eds.), Proceedings of the 16th Nordic Conference of Computational Linguistics NODALIDA-2007, University of Tartu, Tartu, 235–239. Retrieved from [URL] (last accessed January 2013).

Leech, G. (2005). Adding linguistic annotation. In M. Wynne (Ed.), Developing Linguistic Corpora: A Guide to Good Practice (online). Oxford: Oxbow Books. Retrievable from [URL] (last accessed June 2016).

Leech, G., Garside R., & Bryant, M. (1994). The large-scale grammatical tagging of text: Experience with the British National Corpus. In N. Oostdijk & P. de Haan (Eds.), Corpus-based Research into Language (pp. 47–63). Amsterdam: Rodopi.

Linguistic Data Consortium (LDC). (1999). Addendum to the Part-of-Speech Tagging Guidelines for the Penn Treebank Project (Modifications for the SwitchBoard corpus). Retrieved from [URL] (accessed March 2014).

MacWhinney, B. (2009). Enriching CHILDES for morphosyntactic analysis. Carnegie Mellon University, Pittsburgh, PA. Retrieved from [URL] (last accessed June 2016).

. (2012). The CHILDES Project. Tools for analyzing talk – Electronic edition. Part 1: The CHAT transcription format. Carnegie Mellon University. Retrieved from [URL] (last accessed February 2013).

Meurers, D., & Wunsch, H. (2010). Linguistically annotated learner corpora: Aspects of a layered linguistic encoding and standardized representation. In Proceedings of Linguistic Evidence , 1–4. Retrieved from [URL] (last accessed December 2012).

Mukherjee, J. (2007). Exploring and annotating a spoken English learner corpus: A work-in-progress report. In S. Volk-Birke & J. Lippert (Eds.), Anglistentag 2006 Halle: Proceedings (pp. 365–375). Trier: WVT.

Nivre, J., & Grönqvist, L. (2001). Tagging a corpus of spoken Swedish. International Journal of Corpus Linguistics, 6(1), 47–78.

Oxford Advanced Learner’s Dictionary of Current English (7th ed.). (2005). Oxford: Oxford University Press.

Ortega, L. (2010, March). The Bilingual Turn in SLA. Paper presented at the AAAL conference , Atlanta, GA.

Osimk-Teasdale, R. (2013). Applying existing tagging practices to VOICE. In M. Joybrato & M. Huber (Eds.), Corpus Linguistics and Variation in English: Focus on Non-Native Englishes. Helsinki: VARIENG. Retrieved from [URL] (last accessed March 2014).

. (2014). ‘I just wanted to give a partly answer’: Capturing and exploring word class variation in ELF data. Journal of English as a Lingua Franca, 3(1), 109–143.

. (2015). Parts of Speech in English as a Lingua Franca: The POS Tagging of VOICE. (Unpublished doctoral dissertation). University of Vienna, Austria.

Pitzl, M-L., Breiteneder A., & Klimpfinger, T. (2008). A world of words: processes of lexical innovation in VOICE. Views, 17(2), 21–46.

Quirk, R., Greenbaum, S., Leech G., & Svartvik, J. (1985). A Comprehensive Grammar of the English Language. London: Longman.

Rahman, A., & Sampson, G. (2000). Extending grammar annotation standards to spontaneus speech. In J.M. Kirk. (Ed.), Corpora Galore: Analyses and Techniques in Describing English. Papers from the Nineteenth International Conference on English Language Research on Computerised Corpora (ICAME 1998). (pp. 295–311). Amsterdam/Atlanta: Rodopi.

Rastelli, S. (2009). Learner corpora without error tagging. Lingustik Online, 38(2), 57–66.

Santorini, B. (1991). Part of Speech Tagging Guidelines for the Penn Treebank Project. Retrieved from [URL] (last accessed March 2014).

Sampson, G. (2000). CHRISTINE Corpus: Documentation. Retrieved from [URL] (last accessed February 2013).

Schmid, H. (2008). Tokenization and part-of-speech tagging. In A. Lüdeling & M. Kytö (Eds.), Corpus Linguistics. An International Handbook (pp. 527–551). Berlin: Walter de Gruyter.

Seidlhofer, B. (2001). Closing a conceptual gap: The case for a description of English as lingua franca. International Journal of Applied Linguistics, 11(2), 133–158.

. (2011). Understanding English as a Lingua Franca. Oxford: Oxford University Press.

van Eynde, F., Zavrel J., & Daelemans, W. (2000.) Part of speech tagging and lemmatisation for the Spoken Dutch Corpus. In Proceedings of the Second International Conference on Language Resources and Evaluation (LREC 2000) , Athens, Greece, (pp. 1427–1434). Retrieved from [URL] (last accessed March 2014).

VOICE Project. (2013a). Availability. Retrieved from [URL] (last assessed March 2014).

. (2013b). Corpus Information. Retrieved from [URL] (accessed March 2014).

. (2013c). VOICE part-of-speech tagging and lemmatization manual. Retrieved from [URL] (accessed March 2014).

Wynne, M. (Ed). (2005). Developing Linguistic Corpora: A Guide to Good Practice. Oxford: Oxbow Books. Retrievable from [URL] (last accessed 3 June 2016).

Cited by (1)

Cited by one other publication

Riegler, Stefanie

2023. Annotating VOICE for Pedagogic Purposes: The Case for a Mark-up Scheme of Pragmatic Functions in ELF Interactions. In Demystifying Corpus Linguistics for English Language Teaching, ► pp. 207 ff.

This list is based on CrossRef data as of 5 august 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.