The University of Pittsburgh English Language Institute Corpus (PELIC)
This report introduces the University of Pittsburgh English Language Institute Corpus (PELIC; Juffs et al., 2020), a publicly available 4.2-million-word learner corpus of written texts. Collected over seven years in the University of Pittsburgh’s Intensive English Program, these texts were produced by more than 1,100 students with diverse linguistic backgrounds and proficiency levels. Unlike most learner corpora which are cross-sectional, PELIC is longitudinal, offering greater opportunities for tracking development in a natural classroom setting. This potential is illustrated in an overview of the research conducted to date with these data. The report also provides a description of PELIC’s creation and contents, including how the texts have been managed to facilitate natural language processing. Overall, the corpus contributes to the field of learner corpus research by adding to the pool of freely and publicly available learner corpora, supplemented by a useful set of Python tools and tutorials for accessing these data.
Article outline
- 1.Introduction
- 2.Corpus description
- 2.1PELIC background, context, and design
- 2.2Corpus size
- 2.3Participants
- 2.4Corpus summary
- 3.Data collection and processing
- 3.1Data collection
- 3.2Ethical and legal concerns
- 3.3Data cleaning
- 3.4Data processing
- 3.4.1Tokenization
- 3.4.2Part-of-speech tagging and lemmatization
- 4.Additional resources
- 4.1Tutorials
- 4.1.1Corpus compilation
- 4.1.2Exploratory data analysis (EDA)
- 4.1.3Concordancing tutorial
- 4.2Pitt ELI toolkit (PELITK)
- 4.2.1Concordancing package
- 4.2.2Lexical proficiency
- 4.3PELIC spelling
- 4.1Tutorials
- 5.Current PELIC research
- 6.Future developments
- 7.Conclusion
- Acknowledgements
- Notes
-
References
https://doi.org/10.1075/ijlcr.21002.nai
References
Cited by
Cited by 1 other publications
This list is based on CrossRef data as of 16 january 2023. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.