Recent methodological advances have been used to create word lists based on large corpora. The present paper explores whether these corpora — and the associated lists — are unequivocally more representative. Corpus design considerations have usually focused on issues of external representativeness (representing the target discourse domain), while disregarding issues of internal representativeness (whether the corpus permits reliable descriptions of linguistic variation). This disregard may be especially problematic for studies of lexical variation, where it is difficult to achieve stable, reliable results from corpus analysis. The present paper illustrates these challenges through experiments based on analysis of a corpus representing a highly restricted discourse domain: university-level introductory psychology textbooks. The results indicate that corpus design and composition has a much greater influence on lexical variation than previously recognized, highlighting the need to evaluate internal representativeness in quantitative corpus-based research.
Biber, D. (1990). Methodological issues regarding corpus-based analyses of linguistic variation. Literary and Linguistic Computing, 5(4), 257–269.
Biber, D. (1993). Representativeness in corpus design. Literary & Linguistic Computing, 8(4), 243–257.
Biber, D., Conrad, S., & Cortes, V. (2004). If you look at…: Lexical bundles in university teaching and textbooks. Applied Linguistics, 25(3), 371–405.
Biber, D., Conrad, S., & Reppen, R. (1998). Corpus Linguistics: Investigating Structure and Use. Cambridge, UK: Cambridge University Press.
Biber, D., Conrad, S., Reppen, R., Byrd, P., Helt, M., Clark, V., Cortes, V., Csomay, E., & Urzua, A. (2004). Representing Language Use in the University: Analysis of the TOEFL 2000 Spoken and Written Academic Language Corpus. Princeton, NJ: Educational Testing Service.
Biber, D., Johansson, S., Leech, G., Conrad, S., & Finegan, E. (1999). Longman Grammar of Spoken and Written English. New York, NY: Longman.
Brezina, V., & Gablasova, D. (2013). Is there a core general vocabulary? Introducing the New General Service List. Applied Linguistics, 1(23). Retrieved from [URL]
Carroll, J.B., Davies, P., & Richman, B. (1971). The American Heritage Word Frequency Book. . New York, NY: American Heritage.
The College Board. (2010). CLEP® Introductory Psychology: At a Glance. Retrieved from [URL]
Covington, M., & McFall, J. (2010). Cutting the Gordian knot: The moving-average type-token ratio (MATTR). Journal of Quantitative Linguistics, 17(2), 94–100.
Coxhead, A. (2000). A new academic word list. TESOL Quarterly, 34(2), 213–238.
Davies, M. (2010). The Corpus of Contemporary American English as the first reliable monitor corpus of English. Literary and Linguistic Computing, 25(4), 447–464.
Davies, M., & Gardner, D. (2010). A Frequency Dictionary of Contemporary American English. New York, NY: Routledge.
Francis, W.N., & Kucera, H. (1979). Manual of Information to Accompany a Standard Corpus of Present-Day Edited American English, for Use with Digital Computers. Department of Linguistics, Brown University, Providence, RI. Retrieved from [URL]
Gardner, D., & Davies, M. (2013). A new academic vocabulary list. Applied Linguistics, 1(24). Retrieved from [URL]
Gries, S. Th. (2006). Exploring variability within and between corpora: Some methodological considerations. Corpora, 1(2), 109–151.
Heatley, A., & Nation, P. (1994). Range [Web-based tool]. Victoria University of Wellington, NZ. Available from [URL]
Hyland, K. (2008). Academic clusters: Text patterning in published and postgraduate writing. International Journal of Applied Linguistics, 18(1), 41–62.
Hyland, K., & Tse, P. (2007). Is there an “academic vocabulary”?TESOL Quarterly, 41(2), 235–253.
Juilland, A., & Chang-Rodríguez, E. (1964). Frequency Dictionary of Spanish Words. London, UK: Mouton & Co.
Leech, G. (1991). The state of the art in corpus linguistics. In K. Aijmer & B. Altenberg (Eds.), English Corpus Linguistics (pp. 8–29). London, UK: Longman.
Leech, G. (2007). New resources, or just better old ones? The Holy Grail of representativeness. In M. Hundt, N. Nesselhauf & C. Biewer (Eds.), Corpus Linguistics and the Web (pp. 133–149). Amsterdam, Netherlands: Rodopi.
Leech, G., Rayson, P., & Wilson, A. (2001). Word Frequencies in Written and Spoken English: Based on the British National Corpus. London, UK: Longman.
Martínez, I., Beck, S., & Panza, C. (2009). Academic vocabulary in agricultural research articles: A corpus-based study. English for Specific Purposes, 28(3), 183–198.
McEnery, T., & Hardie, A. (2012). Corpus Linguistics: Method, Theory and Practice. Cambridge, UK: Cambridge University Press.
McEnery, T., & Wilson, A. (1996). Corpus Linguistics. Edinburgh, Scotland: Edinburgh University Press.
McEnery, T., Xiao, R., & Tono, Y. (2006). Corpus-based Language Studies: An Advanced Resource Book. New York, NY: Routledge.
Millar, N., & Budgell, B. (2008). The language of public health: A corpus-based analysis. Journal of Public Health, 16(5), 369–374.
Miller, D. (2012). The Challenge of Constructing a Reliable Word List: An Exploratory Corpus-based Analysis of Introductory Psychology textbooks. (Unpublished doctoral dissertation). Northern Arizona University, Flagstaff, AZ.
Nation, I.S.P. (2001). Learning Vocabulary in Another Language. Cambridge, UK: Cambridge University Press.
Nation, I.S.P., & Webb, S. (2011). Researching and Analyzing Vocabulary. Boston, MA: Heinle.
Nation, I.S.P., & Waring, R. (1997). Vocabulary size, text coverage and word lists. In N. Schmitt & M. McCarthy (Eds.), Vocabulary: Description, Acquisition and Pedagogy (pp. 6–19). Cambridge, UK: Cambridge University Press.
Simpson-Vlach, R. & Ellis, N. (2010). An academic formulas list: New methods in phraseology research. Applied Linguistics, 31(4), 487–512.
Schmitt, N. (2010). Researching Vocabulary: A Vocabulary Research Manual. New York, NY: Palgrave Macmillan.
Thorndike, D.L., & Lorge, I. (1944). The Teacher’s Word Book of 30,000 Words. New York, NY: Bureau of Publications, Teachers College, Columbia University.
Tuldava, J. (1995). On the relation between text length and vocabulary size. In J. Tuldava (Ed.), Methods in Quantitative Linguistics (pp. 131–149). Trier, Germany: Wissenschaftlicher Verlag Trier (WVT).
Tweedie, F., & Baayen, H. (1998). How variable may a constant be? Measures of lexical richness in perspective. Computers and the Humanities, 32(1), 323–53.
Wang, J., Liang, S., & Ge, G. (2008). Establishment of a medical academic word list. English for Specific Purposes, 27(4), 442–458.
Ward, J. (2009). A basic engineering English word list for less proficient foundation engineering undergraduates. English for Specific Purposes, 28(3), 170–182.
Xue, G., & Nation, I.S.P. (1984). A university word list. Language Learning and Communication, 3(2), 215–229.
Yule, G.U. (1944). The Statistical Study of LiteraryVocabulary. Cambridge, UK: Cambridge University Press.
Cited by (33)
Cited by 33 other publications
Appel, Randy, Joe Geluso & Hui-Hsien Feng
2024. An examination of phrase-frames in L2 english academic writing: Exploring relationships with writing quality. System 123 ► pp. 103349 ff.
Ballance, Oliver J. & Averil Coxhead
2024. Corpus Analysis of Vocabulary. In The Encyclopedia of Applied Linguistics, ► pp. 1 ff.
Kemp, Jenny
2024. How do I know this Law corpus is reliable and valid? Using a representativeness argument for corpus validation. Applied Corpus Linguistics 4:3 ► pp. 100099 ff.
Drayton, Jenny & Averil Coxhead
2023. The development, evaluation and application of an aviation radiotelephony specialised technical vocabulary list. English for Specific Purposes 69 ► pp. 51 ff.
2022. Technical vocabulary in languages for special purposes: The corpus-based Russian economics word list. Lingua 273 ► pp. 103326 ff.
Michell, Colin
2022. Using Corpus Linguistics to Better Prepare Students for the IELTS Reading Exam. In English Language and General Studies Education in the United Arab Emirates [English Language Teaching: Theory, Research and Pedagogy, ], ► pp. 367 ff.
Naismith, Ben, Alan Juffs, Na-Rae Han & Daniel Zheng
2022. Handle it in-house?. International Journal of Corpus Linguistics 27:3 ► pp. 291 ff.
Pinchbeck, Geoffrey G., Dale Brown, Stuart Mclean & Brandon Kramer
2022. Validating word lists that represent learner knowledge in EFL contexts: The impact of the definition of word and the choice of source corpora. System 106 ► pp. 102771 ff.
Beliaeva, Tatiana Rafaelovna
2021. Frequency and distribution of the units of general scientific (academic) lexicon as the markers of disciplinary affiliation of a discourse. Litera :6 ► pp. 164 ff.
Dong, Luobing, Qiumin Guo, Weili Wu & Meghana N. Satpute
2020. A semantic relatedness preserved subset extraction method for language corpora based on pseudo-Boolean optimization. Theoretical Computer Science 836 ► pp. 65 ff.
Karlińska, Agnieszka
2020. (Nie)przekładalność języków. Analiza korpusowa opinii sądowo-psychiatrycznych. Przegląd Socjologii Jakościowej 16:4 ► pp. 104 ff.
Karlińska, Agnieszka
2021. Textual strategies of forensic psychiatrists. A corpus-based analysis of how the language of psychiatry is reconciled with the language of law in polish forensic psychiatric opinions. International Journal of Law and Psychiatry 74 ► pp. 101652 ff.
Miller, Don
2020. Analysing Frequency Lists. In A Practical Handbook of Corpus Linguistics, ► pp. 77 ff.
Miller, Don
2022. Replication as a means of assessing corpus representativeness and the generalizability of specialized word lists. Applied Corpus Linguistics 2:3 ► pp. 100027 ff.
Pan, Fan
2020. Methodological differences matter: Identification thresholds and corpus composition in lexical bundle research. Southern African Linguistics and Applied Language Studies 38:4 ► pp. 336 ff.
2019. The development and application of a specialised word list: the case of Fabrication. Journal of Vocational Education & Training 71:2 ► pp. 175 ff.
2019. Phraseology and the Advanced Language Learner,
Bruce, Tayyiba
2018. New technologies, continuing ideologies: Online reader comments as a support for media perspectives of minority religions. Discourse, Context & Media 24 ► pp. 53 ff.
Green, Clarence & James Lambert
2018. Advancing disciplinary literacy through English for academic purposes: Discipline-specific wordlists, collocations and word families for eight secondary subjects. Journal of English for Academic Purposes 35 ► pp. 105 ff.
Jakobsen, Anne Sofie, Averil Coxhead & Birgit Henriksen
2018. General and academic high frequency vocabulary in Danish. Nordand 13:1 ► pp. 64 ff.
안의정
2017. Analysing Lexical Diversity and Lexical Density in Korean Texts. Language Facts and Perspectives 41:null ► pp. 349 ff.
Pan, Fan, Randi Reppen & Douglas Biber
2016. Comparing patterns of L1 versus L2 English academic professionals: Lexical bundles in Telecommunications research journals. Journal of English for Academic Purposes 21 ► pp. 60 ff.
This list is based on CrossRef data as of 19 november 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.