Corpus reports
The University of Pittsburgh English Language Institute Corpus (PELIC)
This report introduces the University of Pittsburgh English Language Institute Corpus (PELIC;
Juffs et al., 2020), a publicly available 4.2-million-word learner corpus of
written texts. Collected over seven years in the University of Pittsburgh’s Intensive English Program, these texts were produced
by more than 1,100 students with diverse linguistic backgrounds and proficiency levels. Unlike most learner corpora which are
cross-sectional, PELIC is longitudinal, offering greater opportunities for tracking development in a natural classroom setting.
This potential is illustrated in an overview of the research conducted to date with these data. The report also provides a
description of PELIC’s creation and contents, including how the texts have been managed to facilitate natural language processing.
Overall, the corpus contributes to the field of learner corpus research by adding to the pool of freely and publicly available
learner corpora, supplemented by a useful set of Python tools and tutorials for accessing these data.
Article outline
- 1.Introduction
- 2.Corpus description
- 2.1PELIC background, context, and design
- 2.2Corpus size
- 2.3Participants
- 2.4Corpus summary
- 3.Data collection and processing
- 3.1Data collection
- 3.2Ethical and legal concerns
- 3.3Data cleaning
- 3.4Data processing
- 3.4.1Tokenization
- 3.4.2Part-of-speech tagging and lemmatization
- 4.Additional resources
- 4.1Tutorials
- 4.1.1Corpus compilation
- 4.1.2Exploratory data analysis (EDA)
- 4.1.3Concordancing tutorial
- 4.2Pitt ELI toolkit (PELITK)
- 4.2.1Concordancing package
- 4.2.2Lexical proficiency
- 4.3PELIC spelling
- 5.Current PELIC research
- 6.Future developments
- 7.Conclusion
- Acknowledgements
- Notes
-
References
References
Alexopoulou, T., Geertzen, J., Korhonen, A., & Meurers, D.
Atkinson, K.
(
2019)
Spell
Checking Oriented Word Lists (SCOWL) (
Version 2019).
[URL]
Biber, D., Reppen, R., Staples, S., & Egbert, J.
Bird, S., Loper, E. & Klein, E.
(
2009)
Natural
language processing with Python. O’Reilly Media.
Blanchard, D., Tetreault, J., Higgins, D., Cahill, A., & Chodorow, M.
(
2014)
ETS
Corpus of Non-Native Written English LDC2014T06. Linguistic Data Consortium.
Callies, M.
(
2015)
Learner
corpus methodology. In
S. Granger,
G. Gilquin, &
F. Meunier (Eds.),
The
Cambridge handbook of learner corpus
research (pp. 35–56). Cambridge University Press.
Centre for English Corpus
Linguistics
(
2021a)
Longitudinal Database of Learner English
(LONGDALE). Université catholique de Louvain.
[URL]
Centre for English Corpus
Linguistics
(
2021b)
Learner corpora around the
world. Université catholique de Louvain.
[URL]
Davies, M.
(
2008–)
The
Corpus of Contemporary American English (COCA): 560 million words, 1990-present.
[URL]
Dunlap, S.
(
2012)
Orthographic
quality in English as a second language (Unpublished doctoral
dissertation). University of Pittsburgh.
Etaiwi, W., & Naymat, G.
(
2017)
The
impact of applying different preprocessing steps on review spam detection.
Procedia Computer
Science,
113
1, 273–279.
Gablasova, D., Brezina, V., & McEnery, T.
(
2017)
Exploring
learner language through corpora: Comparing and interpreting corpus frequency
information.
Language
Learning
67
(1), 130–154.
Garbe, W.
(
2020)
SymSpell (
Version
6.7).
[URL]
Gilquin, G.
(
2015)
From
design to collection of learner corpora. In
S. Granger,
G. Gilquin, &
F. Meunier (Eds.),
The
Cambridge handbook of learner corpus
research (pp. 9–34). Cambridge University Press.
Granger, S., Dupont, M., Meunier, F., Naets, H. & Paquot, M.
(
2020)
The
International Corpus of Learner English. Version 3. Presses universitaires de Louvain.
[URL]
Honnibal, M.
(
2013)
A
good part-of-speech tagger in about 200 lines of Python. Explosion.
[URL]
Juffs, A.
(
2020)
Aspects
of language development in an intensive English
program. Routledge.
Juffs, A., & Han, N-R.
(
2019,
March 12).
Combining
formal and usage-based theories with data science techniques in measuring the development of syntactic complexity in written
production. Paper presented at the International Conference of the
American Association of Applied Linguistics, Atlanta, GA.
Juffs, A., Han, N-R., & Naismith, B.
(
2020)
The
University of Pittsburgh English Language Corpus (PELIC) [Data
set].
Leńko-Szymańska, A.
(
2019)
Defining
and assessing lexical proficiency. Routledge.
Marcus, M. P., Santorini, B., Marcinkiewicz, M. A., & Taylor, A.
(
1999)
Treebank-3
LDC99T42 [Web Download]. Linguistic Data Consortium.
[URL]
Meunier, F.
(
2016)
Introduction
to the LONGDALE Project. In
E. Castello,
K. Ackerley, &
F. Coccetta (Eds.),
Studies
in learner corpus linguistics. Research and applications for foreign language teaching and
assessment (pp. 123–126). Peter Lang.
Naismith, B., Han, N.-R., Juffs, A., Hill, B. L., & Zheng, D.
(
2018)
Accurate
measurement of lexical sophistication with reference to ESL learner
data. In
K. E. Boyer &
M. Yudelson (Eds),
Proceedings
of the 11th International Conference on Educational Data
Mining (pp. 259–265).
Naismith, B., & Juffs, A.
(
2021)
Finding
the sweet spot: Learners’ productive knowledge of mid-frequency lexical items.
Language
Teaching Research.
Nation, I. S. P.
(
2013)
Learning
vocabulary in another language (2nd ed.). Cambridge University Press.
Picoral, A., Staples, S., & Reppen, R.
Rankin, T., & Schiftner, B.
Someya, Y.
(
1998)
Someya
Lemma List.
[URL]
Tidball, F., & Treffers-Daller, J.
(
2008)
Analysing
lexical richness in French learner language: what frequency lists and teacher judgements can tell us about basic and advanced
words.
Journal of French Language
Studies,
18
(3), 299–313.
van Rooy, B., & Schäfer, L.
(
2009)
The
effect of learner errors on POS tag errors during automatic POS tagging.
Southern African
Linguistics and Applied Language
Studies,
20
(4), 325–335.
Vercellotti, M. L.
(
2017)
The
development of complexity, accuracy and fluency in second language performance.
Applied
Linguistics,
38
1, 90–111.
Vercellotti, M. L., Juffs, A., & Naismith, B.
(
2021)
Multiword
sequences in L2 English language learners’ speech: The relationship between trigrams and lexical variety across
development.
System,
98
1.
Cited by
Cited by 2 other publications
Xu, Wei
2023.
2023 IEEE 4th Annual Flagship India Council International Subsections Conference (INDISCON),
► pp. 1 ff.
Zhao, Hui, Kexin Jin, Jing Wang & Abid Yahya
2022.
Automatic Recognition and Extraction of English Verb Types Based on Index Line Clustering.
Mobile Information Systems 2022
► pp. 1 ff.
This list is based on CrossRef data as of 20 march 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.