Exploring big educational learner corpora for SLA research: Perspectives on relative clauses

Alexopoulou, Theodora; Geertzen, Jeroen; Korhonen, Anna; Meurers, Detmar

doi:10.1075/ijlcr.1.1.04ale

Article published In:

International Journal of Learner Corpus Research
Vol. 1:1 (2015) ► pp.96–129

Exploring big educational learner corpora for SLA research

Perspectives on relative clauses

Theodora Alexopoulou | Department of Theoretical and Applied Linguistics, University of Cambridge

Jeroen Geertzen

Anna Korhonen

Detmar Meurers | Department of Linguistics, University of Tübingen

We consider the opportunities presented by big educational learner corpora for Second Language Acquisition (SLA). In particular, we focus on the EF Cambridge Open Language Database (EFCAMDAT), an open access database of student writings submitted to Englishtown, the online school of EF Education First. EFCAMDAT stands out for its size (33 million words, 85 thousand learners) and a range of 128 writing tasks covering all CEFR levels with data from learners from varying nationalities. We discuss methodological issues arising from analyzing big data resources generated in educational contexts and argue that Natural Language Processing (NLP) is essential for the automated processing of such datasets. As a study case, we follow the developmental trajectory of relative clauses, a construction that necessitates deeper syntactic analysis. We consider specific issues that can affect the developmental trajectory, including task effects, formulaic language and national language effects.

Keywords: relative clauses, natural language processing for learner language, formulaic sequences, big data, educational learner corpus

Published online: 23 March 2015

https://doi.org/10.1075/ijlcr.1.1.04ale

References (51)

Agresti, A. 2002. An Introduction to Categorical Data Analysis 2. New York: John Wiley & Sons.

Bardovi-Harlig, K. 2000. Tense and Aspect in Second Language Acquisition: Form, Meaning and Use. Oxford: Blackwell.

Bley-Vroman, R. 1989. “What is the logical problem of foreign language learning?”. In S.M. Gass and J. Schachter (Eds.), Linguistic Perspectives on Second Language Acquisition. New York: Cambridge University Press, 41–68.

Cambridge Learner Corpus. 2009. Cambridge ESOL and Cambridge University Press. Available at [URL].

Church, K.W. & Hanks, P. 1990. “Word association norms, mutual information, and lexicography”, Computational Linguistics 16(1), 22–29.

Clark, S. & Curran, J.R. 2007. “Wide-coverage efficient statistical parsing with CCG and log-linear models”, Computational Linguistics 33(4), 493–552.

Council of Europe 2001. Common European Framework of Reference for Languages: Learning, Teaching, Assessment. Cambridge: Cambridge University Press.

de Bot, K., Lowie, W. & Verspoor, M.H. (Eds.). 2011. A Dynamic Approach to Second Language Development. Methods and Techniques. Amsterdam: John Benjamins.

DeKeyser, R.M. 2005. “What makes learning second language grammar difficult? A review of issues”, Language Learning 55, S1, 1–25.

Dulay, H., Burt, M. & Krashen, S. 1982. Language Two. New York: Oxford University Press.

Ellis, N.C. 2010. “Construction learning as category learning”. In M. Pütz & L. Sicola (Eds.), Cognitive Processing and Second Language Acquisition: Inside the Learner’s Mind. John Benjamins, 27–48.

Feldweg, H. 1991. The European Science Foundation Second Language Database. Nijmegen: Max Planck Institute for Psycholinguistics.

Fillmore, L.W. 1979. “Individual differences in second language acquisition”. In C. Fillmore, D. Kempler & W.S.-Y. Wang (Eds.), Individual Differences in Language Ability and Language Behavior. New York: Academic Press, 203–228.

Flynn, S., Foley, C. & Vinnitskaya, I. 2004. “The cumulative enhancement model for language acquisition: comparing adults’ and children’s patterns of development in first, second and third language acquisition of relative clauses”, The International Journal of Multilingualism 1(1), 3–16.

Geertzen, J., Alexopoulou, T., Baker, R., Hendriks, H., Jiang, S. & Korhonen, A. 2013a. The EF Cambridge Open Language Database (EFCAMDAT): User Manual Part I: Writtings. Available at [URL]. (accessed 19 November 2014).

Geertzen, J., Alexopoulou, T. & Korhonen, A. 2013b. “Automatic linguistic annotation of large scale L2 databases: The EF-Cambridge Open Language Database (EFCAMDAT)”. In R.T. Miller, K.I. Martin, C.M. Eddingon, A. Henery, N. Marcos Miguel, A.M. Tseng, A. Tuninetti & D. Walter (Eds.), Proceedings of the 31st Second Language Research Forum (SLRF), Carnegie Mellon. Cascadilla Proceedings Project, 240–254.

Granger, S. 1998. Learner English on Computer. London: Longman.

. 2008. “Learner corpora”. In A. Lüdeling & M. Kytö (Eds.), Corpus Linguistics. An International Handbook. Berlin and New York: Walter de Gruyter, 259–275.

Granger, S., Dagneaux, E. & Meunier, F. 2002. International Corpus of Learner English. Louvain-la-Neuve: Presses Universitaires de Louvain.

Granger, S., Dagneaux, E., Meunier, F. & Paquot, M. 2009. International Corpus of Learner English. Version 2 (Handbook + CD-ROM). Louvain-la-Neuve: Presses universitaires de Louvain.

Granger, S., Kraif, O., Ponton, C., Antoniadis, G. & Zampa, V. 2007. “Integrating learner corpora and natural language processing: A crucial step towards reconciling technological sophistication and pedagogical effectiveness”, ReCaLL 19(3), 252–268.

Hockenmaier, J. & Steedman, M. 2007. “CCGbank: a corpus of CCG derivations and dependency structures extracted from the Penn Treebank”, Computational Linguistics 33(3), 355–396.

Lardiere, D. 1998. “Dissociating syntax from morphology in a divergent L2 end-state grammar”, Second Language Research 14(4), 359–375.

Lozano, C. & Mendikoetxea, A. 2013. “Learner corpora and second language acquisition: The design and collection of CEDEL2”. In N. Ballier, A. Díaz-Negrillo & P. Thompson (Eds.), Automatic Treatment and Analysis of Learner Corpus Data. Amsterdam: John Benjamins, 65–100.

Meunier, F. and Littré, D. 2013. “Tracking learners’ progress: adopting a dual corpus cum experimental data approach”, Modern Language Journal 971, 61–76.

Meurers, D. 2009. “On the automatic analysis of learner language”, CALICO Journal 26(3), 469–473.

Miller, G.A. 1995. “WordNet: a lexical database for English”, Communications of the ACM 38(11), 39–41.

Murakami, A. 2013. L1 Influence and Individual Variation in the L2 Accuracy Development of Grammatical Morphemes: Insights from Learner Corpora. Unpublished doctoral dissertation, University of Cambridge, UK.

Myles, F. 2008. “Investigating learner language development with electronic longitudinal corpora: Theoretical and methodological issues”. In L. Ortega and H. Byrnes (Eds.), The longitudinal Study of Advanced L2 Capacities. New York and London: Routledge, 58–72.

. 2012. “Complexity, accuracy and fluency; the role played by formulaic sequencies in early interlanguage development”. In A. Housen, F. Kuiken, & I. Vedder (Eds.), Dimensions of L2 Performance and Proficiency: Complexity, Accuracy and Fluency in SLA, Language Learning & Language Teaching. Amsterdam & Philadelphia: John Benjamins, 71–94.

Myles, F. & Mitchell, R. 2007. French learner language oral corpora (FLLOC). Available at [URL] (accessed 19 November 2014).

O’Donnell, M.B., Römer, U. & Ellis, N.C. 2013. “The development of formulaic sequences in first and second language writing: investigating effects of frequency, association and native form”, International Journal of Corpus Linguistics 18(1), 83–108.

Orasan, C. & Evans, R. 2007. “NP animacy identification for anaphora resolution”, Journal of Artificial Intelligence Research 291, 79–103.

Ortega, L. 2009. Understanding Second Language Acquisition. London: Hodder Education/Routledge.

Paquot, M. 2013. “Lexical bundles and L1 transfer effects”, International Journal of Corpus Linguistics 18(13), 391–417.

Perdue, C. 1993. Adult Language Acquisition: Volume I: Field Methods. Cambridge University Press.

Rimell, L., Clark, S. & Steedman, M. 2009. “Unbounded dependency recovery for parser evaluation”. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2. Association for Computational Linguistics, 813–821.

Robinson, P. and Ellis, N.C. 2008. Handbook of Cognitive Linguistics and Second Language Acquisition. London and New York: Routledge.

Schachter, J. 1974. “An error in error analysis”, Language Learning 241, 205–214.

Selinker, L. 1972. “Interlanguage”, International Review of Applied Linguistics in Language Teaching 10(1–4), 209–232.

Shirai, Y. & Ozeki, H. 2007. “Introduction to the special issue: The acquisition of relative clauses and the noun phrase accessibility hierarchy: a universal in SLA?”, Studies in Second Language Acquisition 291, 55–167.

Sinclair, J. 2005. “How to build a corpus”. In M. Wynne (Ed.), Developing Linguistic Corpora: A Guide to Good Practice, Oxford: Oxbow Books, 79–83.

Steedman, M. 2000. The Syntactic Process. Cambridge: MIT Press.

Tavakoli, P. & Foster, P. 2008. “Task design and second language performance: the effect of narrative type on learner output”, Language Learning 58(2), 439–473.

Team, R.C. 2008. R: a language and environment for statistical computing. Vienna: Foundation for Statistical Computing.

Tizón-Couto, B. 2013. Clausal Complements in Native and Learner Spoken English. A Corpus-based Study with LINDSEI and VICOLSE. Bern: Peter Lang.

Vyatkina, N. 2012. “The development of second language writing complexity in groups and individuals: A longitudinal learner corpus study”, The Modern Language Journal 961, 576–598.

White, L. 1989. Universal Grammar and Second Language Acquisition. Amsterdam and Philadelphia: John Benjamins.

Wray, A. 2002. Formulaic Language and the Lexicon. New York: Cambridge University Press.

Wulff, S., Ellis, N.C., Römer, U., Bardovi-Harlig, K. & LeBlanc, C. 2009. “The acquisition of tense-aspect: Converging evidence from corpora and telicity readings”, Modern Language Journal 931, 354–369.

Wulff, S., Lester, N. & Martinez-Garcia, M.T. 2014. “ That-variation in German and Spanish L2 English”, Language and Cognition 61, 271–299.

Cited by (27)

Cited by 27 other publications

Order by:

Callies, Marcus

2024. Challenges in the compilation, annotation, and analysis of learner corpus data. In Challenges in Corpus Linguistics [Studies in Corpus Linguistics, 118], ► pp. 55 ff.

Lestari, Febriana

2024. Analysis of verb argument constructions (VACs) in L2 learners across proficiency levels: A corpus-based study in L1 Indonesian. Applied Corpus Linguistics 4:3 ► pp. 100097 ff.

Liu, Yingying & Xiaofei Lu

2024. Development of verb argument constructions in L2 English learners: A close replication of research question 3 in Römer and Berger (2019). Studies in Second Language Acquisition ► pp. 1 ff.

Papadopoulou, Despina, Nikolaos Amvrazis, Gerakini Douka & Alexandros Tantos

2024. Triangulating learner corpus and online experimental data: Evidence from gender agreement and relative clauses in L2 Greek. The Modern Language Journal

Römer-Barron, Ute

2024. How do constructions with modal verbs develop in second language learners of English?. Journal of Second Language Studies

Shatz, Itamar, Theodora Alexopoulou & Akira Murakami

2024. The potential influence of cross-linguistic lexical similarity on lexical diversity in L2 English writing. Corpora 19:2 ► pp. 131 ff.

Derkach, Kateryna & Theodora Alexopoulou

2023. Definite and indefinite article accuracy in learner English: A multifactorial analysis. Studies in Second Language Acquisition ► pp. 1 ff.

Ruggia, Simona & Thomas Gaillat

2023. Les corpus numériques pour la didactique des langues : de la formation des enseignants à l’élaboration de dispositifs d’apprentissage . Corpus :24

Shatz, Itamar, Theodora Alexopoulou, Akira Murakami & Ramona Bongelli

2023. Examining the potential influence of crosslinguistic lexical similarity on word-choice transfer in L2 English. PLOS ONE 18:2 ► pp. e0281137 ff.

Naismith, Ben, Na-Rae Han & Alan Juffs

2022. The University of Pittsburgh English Language Institute Corpus (PELIC). International Journal of Learner Corpus Research 8:1 ► pp. 121 ff.

Naismith, Ben, Alan Juffs, Na-Rae Han & Daniel Zheng

2022. Handle it in-house?. International Journal of Corpus Linguistics 27:3 ► pp. 291 ff.

O'Keeffe, Anne & Geraldine Mark

2022. Principled pattern curation to guide data-driven learning design. Applied Corpus Linguistics 2:3 ► pp. 100028 ff.

Tan, Yi & Ute Römer

2022. Using phrase-frames to trace the language development of L1 Chinese learners of English. System 108 ► pp. 102844 ff.

Chen, Xiaobin, Theodora Alexopoulou & Ianthi Tsimpli

2021. Automatic extraction of subordinate clauses and its application in second language acquisition research. Behavior Research Methods 53:2 ► pp. 803 ff.

Meurers, Detmar

2021. Natural Language Processing and Language Learning. In The Encyclopedia of Applied Linguistics, ► pp. 1 ff.

Azazil, Lina

2020. Frequency effects in the L2 acquisition of the catenative verb construction – evidence from experimental and corpus data . Cognitive Linguistics 31:3 ► pp. 417 ff.

Gilquin, Gaëtanelle

2020. Learner Corpora. In A Practical Handbook of Corpus Linguistics, ► pp. 283 ff.

Shatz, Itamar

2020. Refining and modifying the EFCAMDAT. International Journal of Learner Corpus Research 6:2 ► pp. 220 ff.

Römer, Ute

2019. A corpus perspective on the development of verb constructions in second language learners. International Journal of Corpus Linguistics 24:3 ► pp. 268 ff.

Römer, Ute

2022. Applied corpus linguistics for language acquisition, pedagogy, and beyond. Language Teaching 55:2 ► pp. 233 ff.

Römer, Ute & Cynthia M. Berger

2019. OBSERVING THE EMERGENCE OF CONSTRUCTIONAL KNOWLEDGE. Studies in Second Language Acquisition 41:5 ► pp. 1089 ff.

Zalaltdinova, Liya

2018. “Stop doing this at once!”: The preferred use of modality for advice-giving by English language learners. Intercultural Pragmatics 15:3 ► pp. 349 ff.

Alexopoulou, Theodora, Marije Michel, Akira Murakami & Detmar Meurers

2017. Task Effects on Linguistic Complexity and Accuracy: A Large‐Scale Learner Corpus Analysis Employing Natural Language Processing Techniques. Language Learning 67:S1 ► pp. 180 ff.

Meurers, Detmar & Markus Dickinson

2017. Evidence and Interpretation in Language Learning Research: Opportunities for Collaboration With Computational Linguistics. Language Learning 67:S1 ► pp. 66 ff.

Garner, James R.

2016. A phrase-frame approach to investigating phraseology in learner writing across proficiency levels. International Journal of Learner Corpus Research 2:1 ► pp. 31 ff.

Murakami, Akira

2016. Modeling Systematicity and Individuality in Nonlinear Second Language Development: The Case of English Grammatical Morphemes. Language Learning 66:4 ► pp. 834 ff.

Vyatkina, Nina

2016. TheKansas Developmental Learner corpus(KANDEL). International Journal of Learner Corpus Research 2:1 ► pp. 101 ff.

This list is based on CrossRef data as of 17 october 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.