Chapter 4
PoS-tagging a Spanish oral learner corpus
Criteria, procedure, and a sample analysis
This chapter explains the methodology that was followed to Part of Speech tag the Spanish oral learner corpus CORELE (Corpus Oral de Español como Lengua Extranjera; Campillos Llanos 2014). The data consist of forty interviews with learners at lower intermediate level from more than nine mother tongue (L1) backgrounds, and four interviews with native speakers (control group). The annotation was performed with the GRAMPAL tagger (Moreno & Guirao 2006). The learner corpus amounted to 52,759 lexical units (LUs), and the native corpus, to 8,643 LUs. The interface is available online and allows the user to explore learners’ interlanguage by searching data according to word form, lemma, L1, and/or proficiency level. I present a sample study on learners’ production of articles following the Contrastive Interlanguage Analysis approach (Granger 1996).
Article outline
- 1.Introduction
- 2.A brief overview of previous work
- 2.1Part of Speech tagging learner corpora
- 2.2Studies on articles in learner Spanish
- 3.Methodology
- 3.1Corpus data
-
3.2Part-of-Speech (PoS) tagging
- 3.3Count of lexical units
- 3.4The corpus interface
- 4.A sample analysis of learners’ production of Spanish articles
- 5.Discussion
- 6.Conclusions
-
Acknowledgments
-
Notes
-
References
References (84)
References
Aarts, J. & Granger, S. 1998. Tag sequences in learner corpora: A key to interlanguage grammar and discourse. In Learner English on Computer, S. Granger (ed.), 132–141. London: Addison Wesley Longman.
Bickerton, D. 1981. Roots of Language. Ann Arbor MI: Karoma Press.
Bley-Vroman, R. 1983. The comparative fallacy in interlanguage studies: The case of systematicity. Language Learning 33: 1–17.
Brucart, J.M. 2012. La adquisición del artículo: Flujo informativo y cohesión discursiva. Presentation held at the XI Encuentro de Profesores de ELE, Barcelona, 21 December (2012). <[URL]> (14 April 2015).
Campillos Llanos, L. 2012a. Designing a search interface for a Spanish learner spoken corpus: the end-user’s evaluation. In Proc. of LREC 2012, 23–25 May 2012, Istanbul (Turkey), N. Calzolari, K. Choukri, T. Declerck, M. Uğur Doğan, B. Maegaard, J. Mariani, J. Odijk & S. Piperidis (eds), 241–248. Paris: ELRA.
Campillos Llanos, L. 2012b. La expresión oral en español como lengua extranjera: interlengua y análisis de errores basado en corpus. Unpublished PhD dissertation, Universidad Autónoma de Madrid.
Campillos Llanos, L. 2014. A Spanish learner oral corpus for computer aided error analysis. Corpora 9 (2): 207–238. DOI:
Corder, P. 1971. Idiosyncratic dialects and error analysis. International Review of Applied Linguistics in Language Teaching 9 (2): 147–160.
Council of Europe. 2001. Common European Framework of Reference for Languages: Learning, teaching, assessment. Cambridge: CUP.
Dagneaux, E., Denness, S. & Granger, S. 1998. Computer-aided error analysis. System 26 (2): 163–174.
Díaz-Negrillo, A., Meurers, D., Valera, S. & Wunsch, H. 2010. Towards interlanguage POS annotation for effective learner corpora in SLA and FLT. Language Forum 36 (1–2): 139–154.
Special Issue on Corpus Linguistics for Teaching and Learning. In Honour of John Sinclair, M. Moreno Jaén & C. Pérez Basanta (eds).
Díaz-Negrillo, A. & Thompson, P. 2013. Learner corpora: Looking towards the future. In Automatic Treatment and Analysis of Learner Corpus Data [Studies in Corpus Linguistics 59], A. Díaz-Negrillo, N. Ballier & P. Thompson (eds), 9–30. Amsterdam: John Benjamins.
Dickinson, M. & Ragheb, M. 2009. Dependency annotation for learner corpora. In Proc. of the 8th International Workshop on Treebanks and Linguistic Theories (TLT-8), M. Passarotti, A. Przepiórkowski, S. Raynaud & F. Van Eynde (eds), 59–70.
Dickinson, M. & Raheb, M. 2011. Dependency Annotation of Coordination for Learner Language. In Proc. of the International Conference on Dependency Linguistics (Depling 2011), 5–7 September 2011, Barcelona (Spain), 135–144.
Díez-Bedmar, M.B. & Papp, S. 2008. The use of the English article system by Chinese and Spanish learners. In Linking up Contrastive and Learner Corpus Research, G. Gilquin, S. Papp & M.B. Díez-Bedmar (eds), 147–175. Amsterdam: Rodopi.
Dryer, M.S. 2013a. Definite Articles. In The World Atlas of Language Structures Online, M.S. Dryer & M. Haspelmath (eds. Leipzig: Max Planck Institute for Evolutionary Anthropology. <[URL]>
Dryer, M.S. 2013b. Indefinite articles. In The World Atlas of Language Structures Online, M.S. Dryer & M. Haspelmath (eds). Leipzig: Max Planck Institute for Evolutionary Anthropology. <[URL]>
Fernández, S. 1990. Análisis de errores e interlengua en el aprendizaje del español como lengua extranjera. PhD dissertation, Universidad Complutense. Published as: Interlengua y análisis de errores en el aprendizaje del español como lengua extranjera. (1997). Madrid: Edelsa.
Fitzpatrick, E. & Seegmiller, M.S. 2004. The Montclair Electronic Language Database Project. In Applied Corpus Linguistics: A Multidimensional Perspective, U. Connor & T.A. Upton (eds), 223–237. Amsterdam: Rodopi.
Gaillat, T. 2013. This and that in native and learner English: From typology of use to tagset characterisation. In Twenty Years of Learner Corpus Research: Looking Back, Moving Ahead [Corpora and Language in Use 1], S. Granger, G. Gilquin & F. Meunier (eds), 167–177. Louvain-la-Neuve: Presses universitaires de Louvain.
Gaillat, T., Sébillot, P. & Ballier, N. 2014. Automated classification of unexpected uses of this and that in a learner corpus of English. In Recent Advances in Corpus Linguistics: Developing and Exploiting Corpora, L. Vandelanotte, K. Davidse, C. Gentens & D. Kimps (eds), 309–324. Amsterdam: Rodopi.
Godenzzi, J.C. 1995. The Spanish language in contact with Quechua and Aymara: The use of the article. In Spanish in Four Continents: Studies in Language Contact and Bilingualism, C. Silva-Corvalán (ed.), 101–116. Washington DC: Georgetown University Press.
Goitia, L. 2007. Un estudio del uso del artículo definido por parte de estudiantes estadounidenses de español como lengua extranjera mediante un inventario de frases correctas e incorrectas. Interlingüística 17: 409–418.
Goldsmith, J. 2007. Probability for linguists. Mathématiques et Sciences Humaines. Mathematics and Social Sciences 180: 73–98.
Granger, S. 1996. From CA to CIA and back. An integrated approach to computerized bilingual and learner corpora. In Languages in Contrast, K. Aijmer, B. Altenberg & M. Johansson (eds), 37–51. Lund: Lund University Press.
Granger, S. & Rayson, P. 1998. Automatic profiling of learner texts. In Learner English on Computer, S. Granger (ed.), 119–131. London: Addison Wesley Longman.
Granger, S. 2004. Computer learner corpus research: Current status and future prospects. In Applied Corpus Linguistics. A Multidimensional Perspective, U. Connor & T.A. Upton (eds), 123–145. Amsterdam: Rodopi.
Granger, S., Kraifa, O., Pontona, C., Antoniadis, G. & Zampa, V. 2007. Integrating learner corpora and natural language processing: A crucial step towards reconciling technological sophistication and pedagogical effectiveness. ReCALL 19: 252–268.
Granger, S. Dagneaux, E., Meunier, F. & Paquot, M. 2009. The International Corpus of Learner English, Version 2. Handbook and CD ROM. Louvain la Neuve: Presses universitaires de Louvain.
Hasselgård, H. & Johansson, S. 2011. Learner corpora and contrastive interlanguage analysis. In A Taste for Corpora. In Honour of Sylviane Granger [Studies in Corpus Linguistics 45], F. Meunier, S. De Cock, G. Gilquin & M. Paquot (eds), 33–61. Amsterdam: John Benjamins.
Hawkins, J.A. 1978. Definiteness and Indefiniteness: A Study in Reference and Grammaticality Prediction. London: Croom Helm.
Hawkins, J.A. 1991. On (in)definite articles: Implicatures and (un)grammaticality prediction. Journal of Linguistics 27: 405–442.
Hirschmann, H., Lüdeling, A., Rehbein, I., Reznicek, M. & Zeldes, A. 2013. Underuse of syntactic categories in Falko. A case study on modification. In Twenty Years of Learner Corpus Research: Looking Back, Moving Ahead, S. Granger, G. Gilquin & F. Meunier (eds), 223–234. Louvain-la-Neuve: Presses universitaires de Louvain.
Huebner, T. 1983. A Longitudinal Analysis of the Acquisition of English. Ann Arbor MI: Karoma.
Ionin, T. 2003. Article Semantics in Second Language Acquisition. PhD dissertation, MIT.
Jarvis, S. 2000. Methodological rigor in the study of transfer: Identifying L1 influence in the interlanguage lexicon. Language Learning 50 (2): 245–309.
Jie, S. 2012. El artículo en la enseñanza de ELE. Estudiantes de origen chino. PhD dissertation, Universidad de Barcelona.
Krivanek, J. & Meurers, D. 2011. Comparing rule-based and data-driven dependency parsing of learner language. In Proc. of the International Conference on Dependency Linguistics (Depling 2011), 5–7 September 2011, Barcelona (Spain), 310–317.
Laca, B. 1999. Presencia y ausencia de determinante. In Gramática descriptiva de la lengua Española, Vol. 1, I. Bosque & V. Demonte (eds), 891–928. Madrid: Espasa Calpe, S.A.
Leonetti, M. 1999. El artículo. In Gramática descriptiva de la lengua Española, Vol. 1, I. Bosque & V. Demonte (eds), 787–890. Madrid: Espasa Calpe, S.A.
Li, C.N. & Thompson, S.A. 1990. Chinese. In The World’s Major Languages, Chapter 41, B. Comrie (ed.), 811–833. Oxford: OUP.
Lin, T-J. 2005. La adquisición y el uso del artículo por alumnos chinos. PhD dissertation, Universidad de Alcalá.
Lu, H.-C. 1997. El uso del artículo en español: Errores e implicaciones pedagógicas. In Actas del VIII Congreso Internacional de ASELE, K. Alonso, F.M. Fernández & M.G. Bürmann (eds), 519–525. Alcalá de Henares: Universidad de Alcalá.
Lu, H.-C. & Hsueh, L. L. 2012. Estudio del uso del artículo a partir de un corpus paralelo de aprendices, CPATEI. Revista de Lingüística y Lenguas Aplicadas 7: 193–202. DOI:
Lüdeling, A., Zeldes, A., Reznicek, M., Rehbein, I. & Hirschmann, H. 2010. Syntactic misuse, overuse and underuse: A study of a parsed learner corpus and its target hypothesis. In Proc. of the 9th International Workshop on Treebanks and Linguistic Theories, 3–4 December 2010, University of Tartu (Estonia), M. Dickinson, K. Müürisep & M. Passarotti (eds), 1–4.
MacWhinney, B. 2000. The CHILDES Project: Tools for Analyzing Talk, 3rd edn. Mahwah NJ: Lawrence Erlbaum Associates.
McEnery, T., Xiao, R. & Tono, Y. 2006. Corpus-based Language Studies. An Advanced Resource Book. London: Routledge.
Mendikoetxea, A. 2013. Corpus-based research in second language Spanish. In The Handbook of Spanish Second Language Acquisition, K.L. Geeslin (ed.), 11–29. Hoboken NJ: Wiley-Blackwell.
Meurers, D. 2015. Learner corpora and natural language processing. In The Cambridge Handbook of Learner Corpus Research, S. Granger, G. Gilquin & F. Meunier (eds). Cambridge: CUP.
Milton, J. & Tsang, E. 1993. A corpus-based study of logical connectors in EFL students’ writing: Directions for future research. In Studies in Lexis. Proc. of a Seminar on Lexis Organized by the Language Centre of the HKUST, 6–7 July 1992, Hong Kong, R. Pemberton & E. Tsang (eds), 215–246. Hong Kong: Language Centre, HKUST.
Mitchell, R., Domínguez, L., Arche, M.J., Myles, F. & Marsden, E. 2008. SPLLOC: A new database for Spanish second language acquisition research. In EUROSLA Yearbook 8, L. Roberts, F. Myles & A. David (eds), 287–304. Amsterdam: John Benjamins.
de Mönnink, I. 2000. Parsing a learner corpus? In Corpus Linguistics and Linguistic Theory, C. Mair & M. Hundt (eds), 81–90. Amsterdam: Rodopi.
Morimoto, Y. 2011. El artículo en español. Madrid: Castalia.
Myles, F. 2005. Interlanguage corpora and second language acquisition research. Second Language Research 21 (4): 373–391
Ott, N. & Ziai, R. 2010. Evaluating dependency parsing performance on German learner language. In Proc. of the 9th International Workshop on Treebanks and Linguistic Theories, 3–4 December 2010, University of Tartu, (Estonia), M. Dickinson, K. Müürisep & M. Passarotti (eds), 175–186.
Pęzik P. 2012. Towards the PELCRA Learner English Corpus. In Corpus Data across Languages and Disciplines [Lodz Studies in Language 28], P. Pęzik (ed.), 33–42. Frankfurt: Peter Lang.
Ragheb, M. & Dickinson, M. 2011. Avoiding the comparative fallacy in the annotation of learner corpora. In Selected Proceedings of the 2010 Second Language Research Forum, G. Granena, J. Koeth, S. Lee-Ellis, A. Lukyanchenko, G. Prieto Botana & E. Rhoades (eds), 114–124. Somerville MA: Cascadilla Proceedings Project.
Ramírez-Mayberry, M. 1998. The acquisition of the Spanish definite articles by English-speaking learners of Spanish. Texas Papers on Foreign Language Education 3 (5): 1–57.
Rastelli, S. 2006. ISA 0.9 – Written Italian of Americans: Syntactic and semantic tagging of verbs in a learner corpus. Studi Italiani di Linguistica Teorica e Applicata (SILTA) 1: 73–100.
Rastelli, S. 2009. Learner corpora without error tagging. Linguistik Online 38 (2). <[URL]>
Reznicek, M., Lüdeling, A. & Hirschmann, H. 2013. Competing target hypotheses in the Falko corpus. In Automatic Treatment and Analysis of Learner Corpus Data [Studies in Corpus Linguistics 59], A. Díaz-Negrillo, N. Ballier & P. Thompson (eds), 101–123. Amsterdam: John Benjamins.
van Rooy, B. & Schäfer, L. 2002. The effect of learner errors on POS tag errors during automatic POS tagging. Southern African Linguistics and Applied Language Studies 20 (4): 325–335.
Rosen, A., Hana, J., Štindlová, B., Škodová, S. & Feldman, A. 2014. Evaluating and automating the annotation of a learner corpus. Language Resources and Evaluation 48: 65–92.
Rosén, V. & De Smedt, K. 2010. Syntactic annotation of learner corpora. In Systematisk, variert, men ikke tilfeldig, H. Johansen, A. Golden, J.E. Hagen & A.-K. Helland (eds), 120–132. Oslo: Novus forlag.
Said-Mohand, A. 2007. La adquisición del artículo definido: Evidencia oral y escrita. RedELE 10: 1–15.
Santos, I. 1991. La enseñanza de segundas lenguas. Análisis de errores en la expresión escrita de estudiantes de español cuya lengua nativa es el serbo-croata. PhD dissertation. Madrid, Universidad Complutense.
Scott, M. 2012. WordSmith Tools. Liverpool: Lexical Analysis Software.
Seco, M., Andrés, O. & Ramos, G. 1999. Diccionario del español actual. Madrid: Aguilar.
Snape, N. 2009. Exploring Mandarin Chinese speakers’ L2 article use. In Representational Deficits in SLA: Studies in Honor of Roger Hawkins [Language Acquisition and Language Disorders 47], N. Snape, Y-K. I. Leung & M. S. Smith (eds), 27–52. Amsterdam: John Benjamins. DOI:
Tarrés, I. 2002. El uso del artículo por estudiantes polacos de ELE. MA dissertation, Universidad de Barcelona. <[URL]>
Tenfjord, K., Meurer, P. & Hofland, K. 2006. The ASK corpus: A language learner corpus of Norwegian as a second language. In Proc. of the 5th International Language Resources and Evaluation Conference, 22–28 May, Genova (Italy), 1821–1824.
Thouësny, S. 2011. Increasing the reliability of a part-of-speech tagging tool for use with learner language. In Proc. from Pre-conference (AALL’09) Workshop on Automatic Analysis of Learner Language, Arizona State University, Tempe, AZ.
Tono, Y. 2000. A corpus-based analysis of interlanguage development: analysing POS tag sequences of EFL learner corpora. In Practical Applications in Language Corpora, B. Lewandowska-Tomaszczyk & P.J. Melia (eds), 323–343. Frankfurt: Peter Lang.
Tono, Y. 2002. The Role of Learner Corpora in SLA Research and Foreign Language Teaching: The Multiple Comparison Approach. PhD dissertation, University of Lancaster.
Valverde, M.P. & Ohtani, A. 2014. Annotating article errors in Spanish learner texts: design and evaluation of an annotation scheme. In Proc. of the 28th Pacific Asia Conference on Language, Information and Computation (PACLIC), 12–14 December 2014, Phuket (Thailand), W. Aroonmanakun, T. Supnithi & P. Boonkwan (eds), 234–243.
Vázquez, G. 1991. Análisis de errores y aprendizaje de español/lengua extranjera [Studia Romanica et Linguistica 25]. Frankfurt: Peter Lang.
Zeldes, A., Ritz, J., Lüdeling, A. & Chiarcos, C. 2009. ANNIS: A search tool for multi-layer annotated corpora. In Proc. of the 5th International Corpus Linguistics Conference 2009, 20–23 July 2009, Liverpool (United Kingdom), M. Mahlberg, V. González-Díaz & C. Smith (eds. Liverpool: University of Liverpool.
Cited by (3)
Cited by three other publications
Bonilla, Johnatan E.
2024.
Spoken Spanish PoS tagging: gold standard dataset.
Language Resources and Evaluation
Minnillo, Sophia, Claudia Sánchez-Gutiérrez, Ana Ruiz-Alonso-Bartol, Emily Morgan & Carmen González Gómez
Spina, Stefania, Irene Fioravanti, Luciana Forti & Fabio Zanda
2024.
The CELI corpus: Design and linguistic annotation of a new online learner corpus.
Second Language Research 40:2
► pp. 457 ff.
This list is based on CrossRef data as of 25 july 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.