Article published in:Transfer Effects in Multilingual Language Development
Edited by Hagen Peukert
[Hamburg Studies on Linguistic Diversity 4] 2015
► pp. 297–321
Automated L1 identification in English learner essays and its implications for language transfer
This article focuses on automatic text classification which aims at identifying the first language (L1) background of learners of English. A particular question arising in the context of automated L1 identification is whether any features that are informative for a machine learning algorithm relate to L1-specific transfer phenomena. In order to explore this issue further, we discuss the results of a study carried out in the wake of a Native Language Identification Task. The task is based on the TOEFL11 corpus (cf. Blanchard et al. 2013), which involves a sample of 12,100 essays written by participants in the TOEFL® test from 11 different language backgrounds (Arabic, Chinese, French, German, Hindi, Italian, Japanese, Korean, Spanish, Telugu, and Turkish). The article will show our results in automatic L1 detection in the TOEFL11 corpus. These results are discussed in light of relevant transfer features which turned out to be particularly informative for automatic detection of L1 German and L1 Italian.
Keywords: Italian, German, English learner essays, automated L1 identification, transfer
Published online: 29 April 2015
Aharodnik, K., Chang, M., Feldman, A. & Hana, J.
2013 Automatic identification of learners’ language background based on their writing in Czech. In Proceedings of the 6th International Joint Conference on Natural Language Processing (IJNCLP 2013) , Nagoya, October 2013. 1428-1436.
2011 Automatically Detecting Authors’ Native Language. MA thesis, Naval Postgraduate School, Monterey CA.
Baroni, M. & Bernardini, S.
Bestgen, Y., Granger, S. & Thewissen, J.
2012 Error pattern and automatic L1 identification. In Jarvis & Crossley (eds), Approaching Language Transfer through Text Classification, 127-153.
Blanchard, D., Tetreault, J., Higgins, D., Cahill, A. & Chodorow, M.
Brooke, J. & Hirst, G.
2013 Using other learner corpora in the 2013 NLI Shared Task. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications , 188-196. Atlanta GA: Association for Computational Linguistics.
Crossley, S.A. & McNamara, D.S.
2013a Definite articles. In The World Atlas of Language Structures Online, M.S. Dryer & M. Haspelmath (eds). Leipzig: Max Planck Institute for Evolutionary Anthropology. http://wals.info/chapter/37 (19 November 2013).
2013b Order of adposition and noun phrase. In The World Atlas of Language Structures Online, M.S. Dryer & M. Haspelmath (eds). Leipzig: Max Planck Institute for Evolutionary Anthropology. http://wals.info/chapter/37 (19 November 2013).
Estival, D., Gaustad, T., Son Bao Pham, Radford, Will & Hutchinson, Ben
2007 Author profiling for English emails. In Proceedings of the 10 th Conference of the Pacific Association for Computational Linguistics (PACLING 2007), Melbourne, Australia, 31-39.
Gebre, B.G., Zampieri, M., Wittenburg, P. & Heskes, T.
2013 Improving native language identification with TF-IDF weighting. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications , 216-223. Atlanta GA: ACL.
Golcher, F. & Reznicek, M.
Granger, S., Dagneaux, E., Meunier, F. & Paquot, M.
Jarvis, S. & Pavlenko, A.
Jarvis, S. & Crossley, S.A.
Jarvis, S. & Paquot, M.
Jarvis, S., Castañeda-Jiménez, G. & Nielsen, R.
Jarvis, S., Bestgen, Y. & Pepper, S.
2013 Maximizing classification accuracy in Native Language Identification. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications , 111-118. Atlanta GA: ACL.
Koppel, M., Schler, J. & Zigdon, K.
Mayfield Tomokiyo, L & Jones, R.
2001 You’re not from ‘round here, are you’? Naïve Bayes detection of non-native utterance text. In Proceedings of the Second Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL ’01) . Cambridge MA: ACL.
Pedregosa, F., Varoquaux, G.., et al.
Tetreault, J., Blanchard, D. & Cahill, A.
2013 A report on the first Native Language Identification shared task. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications , 48-57. Atlanta GA: ACL.
Tsur, O. & Rappoport, A.
Van Halteren, H.
Wong, S.-M.J. & Dras, M.
2009 Contrastive analysis and native language identification. In Proceedings of the Australasian Language Technology Association , 53-61. Cambridge MA: ACL.
2011 Exploiting parse structures for Native Language Identification. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing , 1600-1610. Edinburgh.
Wu, C.-Y., Lai, P.-H., Liu, Y. & Ng, V.
2013 Simple yet powerful Native Language Identification on TOEFL11. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications , 152-156. Atlanta GA: ACL.
Yannakoudakis, H., Briscoe, T. & Medlock, B.
2011 A new dataset and method for automatically grading Esol texts. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies , 180-189. Portland OR: ACL.