The Core Metadata Schema for Learner Corpora (LC-meta): Collaborative efforts to advance data discoverability, metadata quality and study comparability in L2 research

Paquot, Magali; König, Alexander; Stemle, Egon W.; Frey, Jennifer-Carmen

doi:10.1075/ijlcr.24010.paq

Article In:

International Journal of Learner Corpus Research: Online-First Articles

The Core Metadata Schema for Learner Corpora (LC-meta)

Collaborative efforts to advance data discoverability, metadata quality and study comparability in L2 research

Magali Paquot | UCLouvain,

Alexander König | CLARIN ERIC,

Egon W. Stemle | Eurac Research,

Jennifer-Carmen Frey | Eurac Research,

Metadata is critical throughout the research process, from study design to corpus selection/compilation, result interpretability and cumulative research. To date, however, learner corpus research has not developed community standards or best practices for metadata collection and sharing. In this article, we present the results of a collaborative project aimed at addressing this issue by developing a standardised metadata schema for learner corpora. We first describe the procedure implemented to design the schema, including the ways in which we continuously involved learner corpus researchers in this initiative. We then introduce the Core Metadata Schema for Learner Corpora (LC-meta, Version 2), which consists in a set of obligatory and optional variables that encapsulate crucial information about L2 data (administrative details, corpus design, text-related variables, learner-related variables, annotations, annotators, or transcribers). Finally, we discuss future developments and emphasise the importance of continued maintenance and further refinement of this schema by the research community.

Keywords: Learner corpus, metadata, community standards, FAIR principles, data discoverability, corpus compilation, study comparability

Article outline

1.Introduction
2.Development of the Core Metadata Schema for Learner Corpora (LC-meta)
3.The Core Metadata Schema for Learner Corpora (LC-meta)
- 3.1General structure of the schema: Eight interrelated components
- 3.2Metadata elements: Design principles and key characteristics
4.Future developments
5.Conclusion
Open material badge
Acknowledgements
Notes
Author queries
References

This content is being prepared for publication; it may be subject to changes.

References (30)

References

Barker, F., Salamoura, A. & Saville, N. (2015). Learner corpora and language testing. In S. Granger, G. Gilquin, & F. Meunier (Eds.), The Cambridge handbook of learner corpus research (pp. 511–533). Cambridge University Press.

Biber, D. & S. Conrad. (2019). Register, genre and style. Cambridge University Press.

Brown, R. (2021). The importance of data citation. BioScience, 71(3), 211.

Burnard, L. (2004). Developing linguistic corpora: a guide to good practice. Metadata for corpus work. [URL]

Carlsen, C. (2012). Proficiency level — A fuzzy variable in computer learner corpora. Applied Linguistics, 33(2), 161–183.

Council of Europe. (2020). Common European Framework of Reference for Languages: Learning, teaching, assessment — Companion volume. Council of Europe Publishing, Strasbourg, available at [URL]

Frey, J.-C., König, A., Stemle, E. & Paquot, M. (2023). Core Metadata Schema for L2 data [Conference presentation]. 32nd Conference of the European Second Language Association (EUROSLA), 30 August — 2 September 2023, University of Birmingham, UK.

Gilquin, G. (2015). From design to collection of learner corpora. In S. Granger, G. Gilquin & F. Meunier (Eds.), The Cambridge handbook of learner corpus research (pp. 10–34). Cambridge University Press.

Granger, S. & Lefer, M.-A. (2020). The Multilingual Student Translation corpus: a resource for translation teaching and research. Language Resources and Evaluation, 541: 1183–1199. [URL].

Granger, S. & Paquot, M. (2017). Towards standardization of metadata for L2 corpora. Invited talk at the CLARIN workshop on Interoperability of Second Language Resources and Tools, 6–8 December 2017, University of Gothenburg, Sweden. [URL]

Higgins, S. (2007). What are metadata standards? Digital Curation Centre. Standards Watch Papers. [URL]

Ide, N. (1998). Encoding linguistic corpora. Sixth Workshop on Very Large Corpora (pp. 9–17). [URL]

Kerz, E. & Wiechmann, D. (2020). Individual differences. In N. Tracy-Ventura & M. Paquot (Eds.), The Routledge handbook of second language acquisition and corpora (pp. 394–406). Routledge.

König, A., Frey, J.-C. & Stemle, E. (2021). Exploring reusability and reproducibility for a research infrastructure for L1 and L2 learner corpora. Information 12(5): 199,

Kormos, J. (2020). Specific learning difficulties in second language learning and teaching. Language Teaching, 53(2), 129–143.

Larsson, T., Paquot, M., & Biber, D. (2021). On the importance of register in learner writing: A multi-dimensional approach. In E. Seoane & D. Biber (Eds.), Corpus based approaches to register variation (pp. 235–258). Benjamins.

Lehmberg, T. & Wörner, K. (2008). Annotation standards. In A. Lüdeling & M. Kytö (Eds.), Corpus linguistics — An international handbook (volume 11) (pp. 484–501). Walter de Gruyter.

Li, S., Hiver, P., & Papi, M. (2022). Individual differences in second language acquisition: Theory, research, and practice. In S. Li, P. Hiver & M. Papi (Eds.), The Routledge handbook of second language acquisition and individual differences (pp. 3–33). Routledge.

Lindström Tiedemann, T., Lenardič, J., & Fišer, D. (2018). L2 learner corpus survey: Towards improved verifiability, reproducibility and inspiration in learner corpus research. Proceedings of CLARIN Annual Conference 2018, Pisa, Italy, 8–10 October 2018, pp. 146–150. [URL]

MacWhinney, B. (2017). A shared platform for studying second language acquisition. Language Learning, 67(S1), 254–275.

(2000). The CHILDES project: Tools for analyzing talk (3rd edition). Lawrence Erlbaum Associates.

(2024). Tools for analyzing talk. Part 1: The CHAT transcription format.

Ortega, L. (2019). SLA and the study of equitable multilingualism. The Modern Language Journal, 1031, 23–38.

Paquot, M. (2023). The Core Metadata Schema for L2 data: Collaborative efforts towards improved data findability, metadata quality and study comparability in L2 research. “Corpus Linguistics and Applied Linguistics Research” series of online talks, Universidad de Murcia, Spain, 30 October 2023. [URL]

Stemle, E. W., Boyd, A., Janssen, M., Tiedemann, T. L., Preradovic, N. M., Rosen, A., Rosén, D., & Volodina, E. (2019). Working together towards an ideal infrastructure for language learner corpora. In A. Abel, A. Glaznieks, V. Lyding & L. Nicolas (Eds.), Widening the scope of learner corpus research. Selected papers from the fourth Learner Corpus Research Conference (pp. 427–468). Corpora and Language in Use — Proceedings 5, Presses Universitaires de Louvain.

Tracy-Ventura, N., Paquot, M. & Myles, F. (2021). The future of corpora in SLA. In N. Tracy-Ventura & M. Paquot (Eds), The Routledge handbook of second language acquisition and corpora (pp. 409–424). Routledge.

Volodina, E., Janssen, M., Lindström Tiedemann, T., Mikelic Preradovic, N., Ragnhildstveit, S., Tenfjord, K., & de Smedt, K. (2018). Interoperability of second language resources and tools. Proceedings of the CLARIN annual conference 2018, Pisa, Italy, 8–10 October 2018, 90–94. [URL]

Wilkinson, M. D., Dumontier, M., Aalbersberg, Ij. J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J.-W., da Silva Santos, L. B., Bourne, P. E., Bouwman, J., Brookes, A. J., Clark, T., Crosas, M., Dillo, I., Dumon, O., Edmunds, S., Evelo, C. T., Finkers, R., … Mons, B. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data, 3(1), 160018.

Windhouwer, M. & Goosen, T. (2022). Component metadata infrastructure. In D. Fišer & A. Witt (Eds.), CLARIN: The infrastructure for language resources (pp. 191–222). De Gruyter.

Wulff, S. (2023). Corpus research. In J. Cabrelli, A. Chaouch-Orozco, J. González Alonso, S. Pereira Soares, E. Puig-Mayenco, & J. Rothman (Eds.), The Cambridge handbook of third language acquisition (pp. 683–695). Cambridge University Press.