Automated L1 identification in English learner essays and its implications for language transfer

Stemle, Egon; Onysko, Alexander

doi:10.1075/hsld.4.13ste

Part of

Transfer Effects in Multilingual Language Development
Edited by Hagen Peukert
[Hamburg Studies on Linguistic Diversity 4] 2015
► pp. 297–321

Automated L1 identification in English learner essays and its implications for language transfer

Egon Stemle | EURAC Bolzano, University of Klagenfurt*

Alexander Onysko | EURAC Bolzano, University of Klagenfurt*

This article focuses on automatic text classification which aims at identifying the first language (L1) background of learners of English. A particular question arising in the context of automated L1 identification is whether any features that are informative for a machine learning algorithm relate to L1-specific transfer phenomena. In order to explore this issue further, we discuss the results of a study carried out in the wake of a Native Language Identification Task. The task is based on the TOEFL11 corpus (cf. Blanchard et al. 2013), which involves a sample of 12,100 essays written by participants in the TOEFL® test from 11 different language backgrounds (Arabic, Chinese, French, German, Hindi, Italian, Japanese, Korean, Spanish, Telugu, and Turkish). The article will show our results in automatic L1 detection in the TOEFL11 corpus. These results are discussed in light of relevant transfer features which turned out to be particularly informative for automatic detection of L1 German and L1 Italian.

Keywords: automated L1 identification, English learner essays, German, Italian, transfer

Published online: 29 April 2015

https://doi.org/10.1075/hsld.4.13ste

References (35)

References

Aharodnik, K., Chang, M., Feldman, A. & Hana, J. 2013. Automatic identification of learners’ language background based on their writing in Czech. In Proceedings of the 6th International Joint Conference on Natural Language Processing (IJNCLP 2013) , Nagoya, October 2013. 1428-1436.

Ahn, C.S. 2011. Automatically Detecting Authors’ Native Language. MA thesis, Naval Postgraduate School, Monterey CA.

Baroni, M. & Bernardini, S. 2006. A new approach to the study of translationese: Machine-learning the difference between original and translated text. Literary and Linguistic Computing 21: 259-274.

Barr, G.K. 2003. Two styles in the New Testament epistles. Literary and Linguistics Computing 18: 235-248.

Bestgen, Y., Granger, S. & Thewissen, J. 2012. Error pattern and automatic L1 identification. In Jarvis & Crossley (eds), 127-153.

Blanchard, D., Tetreault, J., Higgins, D., Cahill, A. & Chodorow, M. 2013. TOEFL11: A Corpus of Non-Native English. Princeton NJ: Educational Testing Service.

Brooke, J. & Hirst, G. 2013. Using other learner corpora in the 2013 NLI Shared Task. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications , 188-196. Atlanta GA: Association for Computational Linguistics.

Crossley, S.A. & McNamara, D.S. 2012. Detecting the first language of second language writers using automated indices of cohesion, lexical sophistication, syntactic complexity and conceptual knowledge. In Jarvis & S.A. Crossley (eds), 106-126.

Dryer, M.S. 2013a. Definite articles. In The World Atlas of Language Structures Online, M.S. Dryer & M. Haspelmath (eds). Leipzig: Max Planck Institute for Evolutionary Anthropology. <[URL]> (19 November 2013).

. 2013b. Order of adposition and noun phrase. In The World Atlas of Language Structures Online, M.S. Dryer & M. Haspelmath (eds). Leipzig: Max Planck Institute for Evolutionary Anthropology. <[URL]> (19 November 2013).

Estival, D., Gaustad, T., Son Bao Pham, Radford, Will & Hutchinson, Ben. 2007. Author profiling for English emails. In Proceedings of the 10 ^th Conference of the Pacific Association for Computational Linguistics (PACLING 2007), Melbourne, Australia, 31-39.

Gebre, B.G., Zampieri, M., Wittenburg, P. & Heskes, T. 2013. Improving native language identification with TF-IDF weighting. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications , 216-223. Atlanta GA: ACL.

Golcher, F. & Reznicek, M. 2011. Stylometry and the interplay of topic and L1 in the different annotation layers in the Falko corpus. In Proceedings of Quantitative Investigations in Theoretical Linguistics 4, A. Zeldes & A. Lüdeling (eds), 29-34. Berlin: Humboldt University.

Granger, S., Dagneaux, E., Meunier, F. & Paquot, M. 2009. The International Corpus of Learner English. Handbook and CD-ROM, Version 2. Louvain-la Neuve: Presses Universitaires de Louvain.

Jarvis, S. & Pavlenko, A. 2008. Crosslinguistic Influence in Language and Cognition. New York NY: Routledge.

Jarvis, S. 2012. The detection-based approach: An overview. In Jarvis & Crossley (eds), 1-33.

Jarvis, S. & Crossley, S.A. (eds). 2012. Approaching Language Transfer through Text Classification. Bristol: Multilingual Matters.

Jarvis, S. & Paquot, M. 2012. Exploring the role of n-grams in L1 identification. In Jarvis & Crossley (eds), 71-105.

Jarvis, S., Castañeda-Jiménez, G. & Nielsen, R. 2012. Detecting L2 writers’ L1 on the basis of their lexical styles. In Jarvis & Crossley (eds), 34-70.

Jarvis, S., Bestgen, Y. & Pepper, S. 2013. Maximizing classification accuracy in Native Language Identification. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications , 111-118. Atlanta GA: ACL.

Koppel, M., Schler, J. & Zigdon, K. 2005. Determining an author’s native language by mining a text for errors. In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining , 624-628. Chicago IL: Association for Computing Machinery.

Mayfield Tomokiyo, L & Jones, R. 2001. You’re not from ‘round here, are you’? Naïve Bayes detection of non-native utterance text. In Proceedings of the Second Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL ’01) . Cambridge MA: ACL.

McLachlan, G.J. 2004. Discriminant Analysis and Statistical Pattern Recognition. Hoboken NJ: Wiley.

Pedregosa, F., Varoquaux, G., et al. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12: 2825-2830.

Slobin, Dan. 2004. The many ways to search for a frog: Linguistic typology & the expression of motion events. In Relating Events in Narrative, Vol. 2: Typological and Contextual Perspectives, S. Strömqvist & L. Verhoeven (eds.), 219-257. Mahwah NJ: Lawrence Erlbaum Associates.

Talmy, L. 2000. Toward a Cognitive Semantics, Vol. 2: Typology and Process in Concept Structuring. Cambridge MA: The MIT Press.

Taylor, B.P. 1975. The use of overgeneralisation and transfer learning strategies by elementary and intermediate students of ESL. Language Learning 25: 73-107.

Tetreault, J., Blanchard, D. & Cahill, A. 2013. A report on the first Native Language Identification shared task. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications , 48-57. Atlanta GA: ACL.

Thomason, S. 2001. Language Contact. Edinburgh: EUP.

Tsur, O. & Rappoport, A. 2007. Using classifier features for studying the effect of native language on the choice of written second language words. In Proceedings of the Workshop on Cognitive Aspects of Computational Language Acquisition , 9-16. Prague: ACL.

Van Halteren, H. 2008. Source language markers in EUROPARL translations. Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008) , 937-944. Manchester.

Wong, S.-M.J. & Dras, M. 2009. Contrastive analysis and native language identification. In Proceedings of the Australasian Language Technology Association , 53-61. Cambridge MA: ACL.

. 2011. Exploiting parse structures for Native Language Identification. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing , 1600-1610. Edinburgh.

Wu, C.-Y., Lai, P.-H., Liu, Y. & Ng, V. 2013. Simple yet powerful Native Language Identification on TOEFL11. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications , 152-156. Atlanta GA: ACL.

Yannakoudakis, H., Briscoe, T. & Medlock, B. 2011. A new dataset and method for automatically grading Esol texts. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies , 180-189. Portland OR: ACL.