The University of Pittsburgh English Language Institute Corpus (PELIC)

Naismith, Ben; Han, Na-Rae; Juffs, Alan

doi:10.1075/ijlcr.21002.nai

Article published In:

International Journal of Learner Corpus Research
Vol. 8:1 (2022) ► pp.121–138

Corpus reports

The University of Pittsburgh English Language Institute Corpus (PELIC)

Ben Naismith | University of Pittsburgh

Na-Rae Han | University of Pittsburgh

Alan Juffs | University of Pittsburgh

This report introduces the University of Pittsburgh English Language Institute Corpus (PELIC; Juffs et al., 2020), a publicly available 4.2-million-word learner corpus of written texts. Collected over seven years in the University of Pittsburgh’s Intensive English Program, these texts were produced by more than 1,100 students with diverse linguistic backgrounds and proficiency levels. Unlike most learner corpora which are cross-sectional, PELIC is longitudinal, offering greater opportunities for tracking development in a natural classroom setting. This potential is illustrated in an overview of the research conducted to date with these data. The report also provides a description of PELIC’s creation and contents, including how the texts have been managed to facilitate natural language processing. Overall, the corpus contributes to the field of learner corpus research by adding to the pool of freely and publicly available learner corpora, supplemented by a useful set of Python tools and tutorials for accessing these data.

Keywords: ESL, IEP, longitudinal development, multi-L1 corpus, PELIC

Article outline

1.Introduction
2.Corpus description
- 2.1PELIC background, context, and design
- 2.2Corpus size
- 2.3Participants
- 2.4Corpus summary
3.Data collection and processing
- 3.1Data collection
- 3.2Ethical and legal concerns
- 3.3Data cleaning
- 3.4Data processing
  - 3.4.1Tokenization
  - 3.4.2Part-of-speech tagging and lemmatization
4.Additional resources
- 4.1Tutorials
  - 4.1.1Corpus compilation
  - 4.1.2Exploratory data analysis (EDA)
  - 4.1.3Concordancing tutorial
- 4.2Pitt ELI toolkit (PELITK)
  - 4.2.1Concordancing package
  - 4.2.2Lexical proficiency
- 4.3PELIC spelling
5.Current PELIC research
6.Future developments
7.Conclusion
Acknowledgements
Notes
References

Published online: 8 March 2022

https://doi.org/10.1075/ijlcr.21002.nai

References (32)

References

Alexopoulou, T., Geertzen, J., Korhonen, A., & Meurers, D. (2015). Exploring big educational learner corpora for SLA research: Perspectives on relative clauses. International Journal of Learner Corpus Research, 1 (1), 96–129.

Atkinson, K. (2019). Spell Checking Oriented Word Lists (SCOWL) (Version 2019). [URL]

Biber, D., Reppen, R., Staples, S., & Egbert, J. (2020). Exploring the longitudinal development of grammatical complexity in the disciplinary writing of L2-English university students. International Journal of Learner Corpus Research, 6 (1), 38–71.

Bird, S., Loper, E. & Klein, E. (2009). Natural language processing with Python. O’Reilly Media.

Blanchard, D., Tetreault, J., Higgins, D., Cahill, A., & Chodorow, M. (2014). ETS Corpus of Non-Native Written English LDC2014T06. Linguistic Data Consortium.

Callies, M. (2015). Learner corpus methodology. In S. Granger, G. Gilquin, & F. Meunier (Eds.), The Cambridge handbook of learner corpus research (pp. 35–56). Cambridge University Press.

Centre for English Corpus Linguistics. (2021a). Longitudinal Database of Learner English (LONGDALE). Université catholique de Louvain. [URL]

. (2021b). Learner corpora around the world. Université catholique de Louvain. [URL]

Davies, M. (2008–). The Corpus of Contemporary American English (COCA): 560 million words, 1990-present. [URL]

Dunlap, S. (2012). Orthographic quality in English as a second language (Unpublished doctoral dissertation). University of Pittsburgh.

Etaiwi, W., & Naymat, G. (2017). The impact of applying different preprocessing steps on review spam detection. Procedia Computer Science, 113 1, 273–279.

Gablasova, D., Brezina, V., & McEnery, T. (2017). Exploring learner language through corpora: Comparing and interpreting corpus frequency information. Language Learning 67 (1), 130–154.

Garbe, W. (2020). SymSpell (Version 6.7). [URL]

Gilquin, G. (2015). From design to collection of learner corpora. In S. Granger, G. Gilquin, & F. Meunier (Eds.), The Cambridge handbook of learner corpus research (pp. 9–34). Cambridge University Press.

Granger, S., Dupont, M., Meunier, F., Naets, H. & Paquot, M. (2020). The International Corpus of Learner English. Version 3. Presses universitaires de Louvain. [URL]

Honnibal, M. (2013). A good part-of-speech tagger in about 200 lines of Python. Explosion. [URL]

Juffs, A. (2020). Aspects of language development in an intensive English program. Routledge.

Juffs, A., & Han, N-R. (2019, March 12). Combining formal and usage-based theories with data science techniques in measuring the development of syntactic complexity in written production. Paper presented at the International Conference of the American Association of Applied Linguistics, Atlanta, GA.

Juffs, A., Han, N-R., & Naismith, B. (2020). The University of Pittsburgh English Language Corpus (PELIC) [Data set].

Leńko-Szymańska, A. (2019). Defining and assessing lexical proficiency. Routledge.

Marcus, M. P., Santorini, B., Marcinkiewicz, M. A., & Taylor, A. (1999). Treebank-3 LDC99T42 [Web Download]. Linguistic Data Consortium. [URL]

Meunier, F. (2016). Introduction to the LONGDALE Project. In E. Castello, K. Ackerley, & F. Coccetta (Eds.), Studies in learner corpus linguistics. Research and applications for foreign language teaching and assessment (pp. 123–126). Peter Lang.

Naismith, B., Han, N.-R., Juffs, A., Hill, B. L., & Zheng, D. (2018). Accurate measurement of lexical sophistication with reference to ESL learner data. In K. E. Boyer & M. Yudelson (Eds), Proceedings of the 11th International Conference on Educational Data Mining (pp. 259–265).

Naismith, B., & Juffs, A. (2021). Finding the sweet spot: Learners’ productive knowledge of mid-frequency lexical items. Language Teaching Research.

Nation, I. S. P. (2013). Learning vocabulary in another language (2nd ed.). Cambridge University Press.

Picoral, A., Staples, S., & Reppen, R. (2021). Automated annotation of learner English. International Journal of Learner Corpus Research, 7 (1), 17–52.

Rankin, T., & Schiftner, B. (2011). Marginal prepositions in learner English: Applying local corpus data. International Journal of Corpus Linguistics, 16 (3), 412–34.

Someya, Y. (1998). Someya Lemma List. [URL]

Tidball, F., & Treffers-Daller, J. (2008). Analysing lexical richness in French learner language: what frequency lists and teacher judgements can tell us about basic and advanced words. Journal of French Language Studies, 18 (3), 299–313.

van Rooy, B., & Schäfer, L. (2009). The effect of learner errors on POS tag errors during automatic POS tagging. Southern African Linguistics and Applied Language Studies, 20 (4), 325–335.

Vercellotti, M. L. (2017). The development of complexity, accuracy and fluency in second language performance. Applied Linguistics, 38 1, 90–111.

Vercellotti, M. L., Juffs, A., & Naismith, B. (2021). Multiword sequences in L2 English language learners’ speech: The relationship between trigrams and lexical variety across development. System, 98 1.

Cited by (4)

Cited by four other publications

Order by:

Kyle, Kristopher & Masaki Eguchi

2024. Evaluating NLP models with written and spoken L2 samples. Research Methods in Applied Linguistics 3:2 ► pp. 100120 ff.

Martin, Katherine I.

2024. How a Phonics-Based Intervention, L1 Orthography, and Item Characteristics Impact Adult ESL Spelling Knowledge. Education Sciences 14:4 ► pp. 421 ff.

Xu, Wei

2023. 2023 IEEE 4th Annual Flagship India Council International Subsections Conference (INDISCON), ► pp. 1 ff.

Zhao, Hui, Kexin Jin, Jing Wang & Abid Yahya

2022. Automatic Recognition and Extraction of English Verb Types Based on Index Line Clustering. Mobile Information Systems 2022 ► pp. 1 ff.

This list is based on CrossRef data as of 4 july 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.