Article published in:Transfer Effects in Multilingual Language Development
Edited by Hagen Peukert
[Hamburg Studies on Linguistic Diversity 4] 2015
► pp. 297–321
Automated L1 identification in English learner essays and its implications for language transfer
This article focuses on automatic text classification which aims at identifying the first language (L1) background of learners of English. A particular question arising in the context of automated L1 identification is whether any features that are informative for a machine learning algorithm relate to L1-specific transfer phenomena. In order to explore this issue further, we discuss the results of a study carried out in the wake of a Native Language Identification Task. The task is based on the TOEFL11 corpus (cf. Blanchard et al. 2013), which involves a sample of 12,100 essays written by participants in the TOEFL® test from 11 different language backgrounds (Arabic, Chinese, French, German, Hindi, Italian, Japanese, Korean, Spanish, Telugu, and Turkish). The article will show our results in automatic L1 detection in the TOEFL11 corpus. These results are discussed in light of relevant transfer features which turned out to be particularly informative for automatic detection of L1 German and L1 Italian.
Keywords: Italian, German, English learner essays, automated L1 identification, transfer
Published online: 29 April 2015
Aharodnik, K., Chang, M., Feldman, A. & Hana, J.
2013 Automatic identification of learners’ language background based on their writing in Czech. In Proceedings of the 6th International Joint Conference on Natural Language Processing (IJNCLP 2013) , Nagoya, October 2013. 1428-1436.
2011 Automatically Detecting Authors’ Native Language. MA thesis, Naval Postgraduate School, Monterey CA.
Baroni, M. & Bernardini, S.
Bestgen, Y., Granger, S. & Thewissen, J.
2012 Error pattern and automatic L1 identification. In Jarvis & Crossley (eds), Approaching Language Transfer through Text Classification, 127-153.
Blanchard, D., Tetreault, J., Higgins, D., Cahill, A. & Chodorow, M.
2013 TOEFL11: A Corpus of Non-Native English. Princeton NJ: Educational Testing Service.
Brooke, J. & Hirst, G.
2013 Using other learner corpora in the 2013 NLI Shared Task. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications , 188-196. Atlanta GA: Association for Computational Linguistics.
Crossley, S.A. & McNamara, D.S.
2012 Detecting the first language of second language writers using automated indices of cohesion, lexical sophistication, syntactic complexity and conceptual knowledge. In Jarvis & S.A. Crossley (eds), Approaching Language Transfer through Text Classification, 106-126.
2013a Definite articles. In The World Atlas of Language Structures Online, M.S. Dryer & M. Haspelmath (eds). Leipzig: Max Planck Institute for Evolutionary Anthropology. http://wals.info/chapter/37 (19 November 2013).
2013b Order of adposition and noun phrase. In The World Atlas of Language Structures Online, M.S. Dryer & M. Haspelmath (eds). Leipzig: Max Planck Institute for Evolutionary Anthropology. http://wals.info/chapter/37 (19 November 2013).
Estival, D., Gaustad, T., Son Bao Pham, Radford, Will & Hutchinson, Ben
2007 Author profiling for English emails. In Proceedings of the 10 th Conference of the Pacific Association for Computational Linguistics (PACLING 2007), Melbourne, Australia, 31-39.
Gebre, B.G., Zampieri, M., Wittenburg, P. & Heskes, T.
2013 Improving native language identification with TF-IDF weighting. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications , 216-223. Atlanta GA: ACL.
Golcher, F. & Reznicek, M.
2011 Stylometry and the interplay of topic and L1 in the different annotation layers in the Falko corpus. In Proceedings of Quantitative Investigations in Theoretical Linguistics 4, A. Zeldes & A. Lüdeling (eds), 29-34. Berlin: Humboldt University.
Granger, S., Dagneaux, E., Meunier, F. & Paquot, M.
2009 The International Corpus of Learner English. Handbook and CD-ROM, Version 2. Louvain-la Neuve: Presses Universitaires de Louvain.
Jarvis, S. & Pavlenko, A.
2008 Crosslinguistic Influence in Language and Cognition. New York NY: Routledge.
2012 The detection-based approach: An overview. In Jarvis & Crossley (eds), Approaching Language Transfer through Text Classification, 1-33.
Jarvis, S. & Crossley, S.A.
(eds) 2012 Approaching Language Transfer through Text Classification. Bristol: Multilingual Matters.
Jarvis, S. & Paquot, M.
2012 Exploring the role of n-grams in L1 identification. In Jarvis & Crossley (eds), Approaching Language Transfer through Text Classification, 71-105.
Jarvis, S., Castañeda-Jiménez, G. & Nielsen, R.
2012 Detecting L2 writers’ L1 on the basis of their lexical styles. In Jarvis & Crossley (eds), Approaching Language Transfer through Text Classification, 34-70.
Jarvis, S., Bestgen, Y. & Pepper, S.
2013 Maximizing classification accuracy in Native Language Identification. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications , 111-118. Atlanta GA: ACL.
Koppel, M., Schler, J. & Zigdon, K.
Mayfield Tomokiyo, L & Jones, R.
2001 You’re not from ‘round here, are you’? Naïve Bayes detection of non-native utterance text. In Proceedings of the Second Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL ’01) . Cambridge MA: ACL.
2004 Discriminant Analysis and Statistical Pattern Recognition. Hoboken NJ: Wiley.
Pedregosa, F., Varoquaux, G.., et al.
2011 Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12: 2825-2830.
2004 The many ways to search for a frog: Linguistic typology & the expression of motion events. In Relating Events in Narrative, Vol. 2: Typological and Contextual Perspectives, S. Strömqvist & L. Verhoeven (eds.), 219-257. Mahwah NJ: Lawrence Erlbaum Associates.
2000 Toward a Cognitive Semantics, Vol. 2: Typology and Process in Concept Structuring. Cambridge MA: The MIT Press.
Tetreault, J., Blanchard, D. & Cahill, A.
2013 A report on the first Native Language Identification shared task. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications , 48-57. Atlanta GA: ACL.
Tsur, O. & Rappoport, A.
Van Halteren, H.
Wong, S.-M.J. & Dras, M.
2009 Contrastive analysis and native language identification. In Proceedings of the Australasian Language Technology Association , 53-61. Cambridge MA: ACL.
2011 Exploiting parse structures for Native Language Identification. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing , 1600-1610. Edinburgh.
Wu, C.-Y., Lai, P.-H., Liu, Y. & Ng, V.
2013 Simple yet powerful Native Language Identification on TOEFL11. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications , 152-156. Atlanta GA: ACL.
Yannakoudakis, H., Briscoe, T. & Medlock, B.
2011 A new dataset and method for automatically grading Esol texts. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies , 180-189. Portland OR: ACL.