Natural language processing for learner corpus research

Kyle, Kristopher

doi:10.1075/ijlcr.00019.int

Introduction published In:

Natural language processing for learner corpus research
Edited by Kristopher Kyle
[International Journal of Learner Corpus Research 7:1] 2021
► pp. 1–16

Introduction

Natural language processing for learner corpus research

Kristopher Kyle | University of Oregon | Yonsei University

Article outline

1.Introduction to NLP
- The role of training corpora in NLP
- Tokenization
- Lemmatization
- Part of speech annotation
- Constituency parse annotation
- Dependency relation annotation
2.Some specific challenges for calculating accuracy in LCR research
3.The present issue
Notes
References

This article is available free of charge.

Published online: 1 March 2021

https://doi.org/10.1075/ijlcr.00019.int

References (52)

References

Alexopoulou, T., Michel, M., Murakami, A., & Meurers, D. (2017). Task Effects on Linguistic Complexity and Accuracy: A Large-Scale Learner Corpus Analysis Employing Natural Language Processing Techniques. Language Learning, 67(S1), 180–208.

Anthony, L. (2014). AntWordProfiler (Version 1.4. 1)[Computer Software]. Tokyo, Japan: Waseda University.

(2019). AntConc (3.5.8) [Computer software]. Tokyo, Japan: Waseda University.

Bauer, L., & Nation, I. S. P. (1993). Word families. International Journal of Lexicography, 6(4), 253–279.

Berzak, Y., Kenney, J., Spadine, C., Wang, J. X., Lam, L., Mori, K. S., Garza, S., & Katz, B. (2016). Universal dependencies for learner English. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (pp. 737–746). Stroudsburg: Association for Computational Linguistics.

Bestgen, Y., & Granger, S. (2014). Quantifying the development of phraseological competence in L2 English writing: An automated approach. Journal of Second Language Writing, 261, 28–41.

Biber, D. (1988). Variation across speech and writing. Cambridge: Cambridge University Press.

Biber, D., Gray, B., & Staples, S. (2014). Predicting Patterns of Grammatical Complexity Across Language Exam Task Types and Proficiency Levels. Applied Linguistics, 37(5), 639–668.

Chen, D., & Manning, C. D. (2014). A fast and accurate dependency parser using neural networks. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 740–750). Stroudsburg: Association for Computational Linguistics.

Choi, J. D., Tetreault, J., & Stent, A. (2015). It depends: Dependency parser comparison using a web-based evaluation tool. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (pp. 387–396). Stroudsburg: Association for Computational Linguistics.

Cobb, T. (2018). Web VocabProfile (WebVP). [Computer Software].

Crossley, S. A., Kyle, K., & Dascalu, M. (2019). The Tool for the Automatic Analysis of Cohesion 2.0: Integrating semantic similarity and text overlap. Behavior Research Methods, 51(1), 14–27.

Crossley, S. A., & McNamara, D. S. (2012). Predicting second language writing proficiency: The roles of cohesion and linguistic sophistication. Journal of Research in Reading, 35(2), 115–135.

Díez-Bedmar, M. B., & Pérez-Paredes, P. (2020). Noun phrase complexity in young Spanish EFL learners’ writing: Complementing syntactic complexity indices with corpus-driven analyses. International Journal of Corpus Linguistics, 25(1), 4–35.

Explosion AI. (2018). spaCy language models. Retrieved from [URL]

Garside, R., Leech, G. N., & McEnery, T. (1997). Corpus annotation: Linguistic information from computer text corpora. Harlow: Longman.

Geertzen, J., Alexopoulou, T., & Korhonen, A. (2013). Automatic linguistic annotation of large scale L2 databases: The EF-Cambridge Open Language Database (EFCAMDAT). In R. T. Miller, K. I. Martin, C. M. Eddington, A. Henery, N. Marcos Miguel, A. M. Tseng, A. Tuninetti, & D. Walter (Eds.), Selected Proceedings of the 2012 Second Language Research Forum (pp. 240–254). Somerville, MA: Cascadilla Proceedings Project.

Graesser, A. C., McNamara, D. S., Louwerse, M. M., & Cai, Z. (2004). Coh-Metrix: Analysis of text on cohesion and language. Behavior Research Methods, Instruments, & Computers, 36(2), 193–202.

Granger, S., & Bestgen, Y. (2017). Using collgrams to assess L2 phraseological development: A replication study. In P. Haan, R. de Vries, & S. van Vuuren (Eds.), Language, Learners and Levels: Progression and Variation (pp. 385–408). Louvain-la-Neuve: Presses universitaires de Louvain.

Green, C. (2019). Enriching the academic wordlist and Secondary Vocabulary Lists with lexicogrammar: Toward a pattern grammar of academic vocabulary. System, 871, 102158.

Heatley, A., & Nation, I. S. P. (1994). Range. [Computer Software]. Victoria University of Wellington, NZ. Retrieved from [URL]

Huang, Y., Murakami, A., Alexopoulou, T., & Korhonen, A. (2018). Dependency parsing of learner English. International Journal of Corpus Linguistics, 23(1), 28–54.

Jurafsky, D., & Manning, C. D. (2008). Speech and language processing: An introduction to natural language processing, speech recognition, and computational linguistics (2nd ed.). Upper Saddle River: Prentice-Hall.

Jurafsky, D., & Martin, J. H. (2019). Speech and Language Processing (Unpublished Manuscript). October 2019. Retrieved from [URL]

Khushik, G. A., & Huhta, A. (2020). Investigating Syntactic Complexity in EFL Learners’ Writing across Common European Framework of Reference Levels A1, A2, and B1. Applied Linguistics, 41(4), 506–532.

Kitaev, N., & Klein, D. (2018). Constituency parsing with a self-attentive encoder. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 2676–2686). Stroudsburg: Association for Computational Linguistics.

Klein, D., & Manning, C. D. (2003). Accurate unlexicalized parsing. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (pp. 423–430). Stroudsburg: Association for Computational Linguistics.

Kyle, K. (2016). Measuring Syntactic Development in L2 Writing: Fine Grained Indices of Syntactic Complexity and Usage-Based Indices of Syntactic Sophistication (Unpublished doctorial dissertation). Georgia State University, Atlanta. [URL]

Kyle, K., & Crossley, S. A. (2017). Assessing syntactic sophistication in L2 writing: A usage-based approach. Language Testing, 34(4), 513–535.

(2018). Measuring Syntactic Complexity in L2 Writing Using Fine-Grained Clausal and Phrasal Indices. The Modern Language Journal, 102(2), 333–349.

Kyle, K., Crossley, S. A., & Verspoor, M. (in press). Measuring longitudinal writing development using indices of syntactic complexity and VAC sophistication. Studies in Second Language Acquisition.

Kyle, K., & Eguchi, M. (in press). Automatically assessing lexical sophistication using word, bigram, and dependency indices. In S. Granger (Ed.), Perspectives on the Second Language Phrasicon: The View from Learner Corpora. Bristol: Multilingual Matters.

(in progress). A gold standard part of speech tagged and dependency parsed corpus of L2 speech.

Levy, R., & Andrew, G. (2006). Tregex and Tsurgeon: Tools for querying and manipulating tree data structures. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06) (pp. 2231–2234). European Language Resources Association (ELRA).

Lu, X. (2010). Automatic analysis of syntactic complexity in second language writing. International Journal of Corpus Linguistics, 15(4), 474–496.

Lu, X., & Ai, H. (2015). Syntactic complexity in college-level English writing: Differences among writers with diverse L1 backgrounds. Journal of Second Language Writing, 291, 16–27.

McNamara, D. S., Graesser, A. C., McCarthy, P. M., & Cai, Z. (2014). Automated evaluation of text and discourse with Coh-Metrix. Cambridge: Cambridge University Press.

Meurers, D., & Dickinson, M. (2017). Evidence and interpretation in language learning research: Opportunities for collaboration with computational linguistics. Language Learning, 67(S1), 66–95.

Nivre, J., Hall, J., & Nilsson, J. (2006). MaltParser: A Data-Driven Parser-Generator for Dependency Parsing. In Proceedings of the fifth international conference on language resources and evaluation (LREC’06) (pp. 2216–2219). European Language Resources Association (ELRA).

Paquot, M. (2018). Phraseological Competence: A Missing Component in University Entrance Language Tests? Insights From a Study of EFL Learners’ Use of Statistical Collocations. Language Assessment Quarterly, 15(1), 29–43.

(2019). The phraseological dimension in interlanguage complexity research. Second Language Research, 35(1), 121–145.

Paquot, M., Naets, H., & Gries, S. T. (in press). Using syntactic co-occurrences to trace phraseological complexity development in learner writing: Verb + object structures in LONGDALE. In B. LeBruyn & M. Paquot (Eds.), Learner Corpus Research Meets Second Language Acquisition. Cambridge: Cambridge University Press.

Pinchbeck, G. G. (2017). Vocabulary Use in Academic-Track High-School English Literature Diploma Exam Essay Writing and its Relationship to Academic Achievement (Unpublished doctoral dissertation). University of Calgary, Calgary.

Polio, C., & Yoon, H. (2018). The reliability and validity of automated tools for examining variation in syntactic complexity across genres. International Journal of Applied Linguistics, 28(1), 165–188.

Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. In International Conference on New Methods in Language Processing (pp. 44–49). Manchester, UK.

(1995). Treetagger: A language independent part-of-speech tagger [Computer software] Institut Für Maschinelle Sprachverarbeitung, Universität Stuttgart, Stuttgart.

Scott, M. (2020). WordSmith Tools (8.0) [Computer software]. Liverpool: Lexical Analysis Software.

Toutanova, K., Klein, D., Manning, C. D., & Singer, Y. (2003). Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology – Volume 1 (pp. 173–180). Stroudsburg: Association for Computational Linguistics.

van den Bosch, A., Busser, B., Canisius, S., & Daelemans, W. (2007). An efficient memory-based morphosyntactic tagger and parser for Dutch. In P. Dirix, I. Schuurman, V. Vandeghinste, & F. Van Eynde (Eds.), Proceedings of the 17th meeting of Computational Linguistics in the Netherlands (pp. 191–206).

van Noord, G. (2006). At last parsing is now operational. In TALN 2006 (pp. 20–42).

Weischedel, R., Palmer, M., Marcus, M., Hovy, E., Pradhan, S., Ramshaw, L., Xue, N., Taylor, A., Kaufman, J., & Franchini, M. (2013). Ontonotes release 5.0. Philadelphia: Linguistic Data Consortium. Retrieved from [URL]

Yannakoudakis, H., Briscoe, T., & Medlock, B. (2011). A new dataset and method for automatically grading ESOL texts. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (pp. 180–189). Stroudsburg: Association for Computational Linguistics.

Cited by (10)

Cited by ten other publications

Order by:

Aldana-Bobadilla, Edwin, Victor Jesus Sosa-Sosa, Alejandro Molina-Villegas, Karina Gazca-Hernandez & Jose Angel Olivas

2024. An Eclectic Approach for Enhancing Language Models Through Rich Embedding Features. IEEE Access 12 ► pp. 100921 ff.

Götz, Sandra & Sylviane Granger

2024. Learner corpus research for pedagogical purposes. International Journal of Learner Corpus Research 10:1 ► pp. 1 ff.

Hwang, Haerim & Hyunwoo Kim

2024. Korean Syntactic Complexity Analyzer (KOSCA): An NLP application for the analysis of syntactic complexity in second language production . Language Testing 41:3 ► pp. 506 ff.

Kyle, Kristopher & Masaki Eguchi

2024. Evaluating NLP models with written and spoken L2 samples. Research Methods in Applied Linguistics 3:2 ► pp. 100120 ff.

Shin, Gyu-Ho, Boo Kyung Jung & Seongmin Mun

2024. Transformer-based text similarity and second language proficiency: A case of written production by learners of Korean. Natural Language Processing Journal 6 ► pp. 100060 ff.

Châu, Quang Hồng & Bram Bulté

2023. A comparison of automated and manual analyses of syntactic complexity in L2 English writing. International Journal of Corpus Linguistics 28:2 ► pp. 232 ff.

Paquot, Magali & Nicole Tracy‐Ventura

2023. Using Foreign and Second Language Learner Corpora. In Current Approaches in Second Language Acquisition Research, ► pp. 96 ff.

McCallum, Lee & Philip Durrant

2022. Shaping Writing Grades,

Vuuren, Sanne van, Janine Berns & Marketa Bank

2022. Strategies of clausal postmodification in learner English. International Journal of Learner Corpus Research 8:2 ► pp. 157 ff.

Vandeweerd, Nathan

2021. fsca . International Journal of Learner Corpus Research 7:2 ► pp. 259 ff.

This list is based on CrossRef data as of 17 october 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.