Exploring big educational learner corpora for SLA research
Perspectives on relative clauses
We consider the opportunities presented by big educational learner corpora for Second Language Acquisition (SLA). In particular, we focus on the EF Cambridge Open Language Database (EFCAMDAT), an open access database of student writings submitted to Englishtown, the online school of EF Education First. EFCAMDAT stands out for its size (33 million words, 85 thousand learners) and a range of 128 writing tasks covering all CEFR levels with data from learners from varying nationalities. We discuss methodological issues arising from analyzing big data resources generated in educational contexts and argue that Natural Language Processing (NLP) is essential for the automated processing of such datasets. As a study case, we follow the developmental trajectory of relative clauses, a construction that necessitates deeper syntactic analysis. We consider specific issues that can affect the developmental trajectory, including task effects, formulaic language and national language effects.
Keywords: relative clauses, natural language processing for learner language, formulaic sequences, big data, educational learner corpus
Published online: 23 March 2015
Cambridge Learner Corpus
2009 Cambridge ESOL and Cambridge University Press. Available at http://cambridge.org/gb/elt/catalogue/subject/custom/item3646603/Cambridge-International-Corpus-Cambridge-Learner-Corpus/.
Church, K.W. & Hanks, P.
Clark, S. & Curran, J.R.
Council of Europe
de Bot, K., Lowie, W. & Verspoor, M.H.
Flynn, S., Foley, C. & Vinnitskaya, I.
Geertzen, J., Alexopoulou, T., Baker, R., Hendriks, H., Jiang, S. & Korhonen, A.
2013a The EF Cambridge Open Language Database (EFCAMDAT): User Manual Part I: Writtings. Available at http://corpus.mml.cam.ac.uk/EFCAMDAT/EFCAMDATUserManualv02.pdf. (accessed 19 November 2014).
Geertzen, J., Alexopoulou, T. & Korhonen, A.
2013b “Automatic linguistic annotation of large scale L2 databases: The EF-Cambridge Open Language Database (EFCAMDAT)”. In R.T. Miller, K.I. Martin, C.M. Eddingon, A. Henery, N. Marcos Miguel, A.M. Tseng, A. Tuninetti & D. Walter (Eds.), Proceedings of the 31st Second Language Research Forum (SLRF), Carnegie Mellon. Cascadilla Proceedings Project, 240–254.
Granger, S., Dagneaux, E. & Meunier, F.
Granger, S., Dagneaux, E., Meunier, F. & Paquot, M.
Granger, S., Kraif, O., Ponton, C., Antoniadis, G. & Zampa, V.
Hockenmaier, J. & Steedman, M.
Lozano, C. & Mendikoetxea, A.
Meunier, F. and Littré, D.
2013 L1 Influence and Individual Variation in the L2 Accuracy Development of Grammatical Morphemes: Insights from Learner Corpora. Unpublished doctoral dissertation, University of Cambridge, UK.
2012 “Complexity, accuracy and fluency; the role played by formulaic sequencies in early interlanguage development”. In A. Housen, F. Kuiken, & I. Vedder (Eds.), Dimensions of L2 Performance and Proficiency: Complexity, Accuracy and Fluency in SLA, Language Learning & Language Teaching. Amsterdam & Philadelphia: John Benjamins, 71–94.
Myles, F. & Mitchell, R.
2007 French learner language oral corpora (FLLOC). Available at http://www.flloc.soton.ac.uk/ (accessed 19 November 2014).
O’Donnell, M.B., Römer, U. & Ellis, N.C.
Orasan, C. & Evans, R.
Rimell, L., Clark, S. & Steedman, M.
Robinson, P. and Ellis, N.C.
Shirai, Y. & Ozeki, H.
Tavakoli, P. & Foster, P.
Wulff, S., Ellis, N.C., Römer, U., Bardovi-Harlig, K. & LeBlanc, C.
Cited by 12 other publications
Alexopoulou, Theodora, Marije Michel, Akira Murakami & Detmar Meurers
Chen, Xiaobin, Theodora Alexopoulou & Ianthi Tsimpli
Garner, James R.
Meurers, Detmar & Markus Dickinson
Römer, Ute & Cynthia M. Berger
This list is based on CrossRef data as of 27 july 2021. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.