This paper discusses the degree to which most of the most widely-used measures of dispersion in corpus linguistics
are not particularly valid in the sense of actually measuring dispersion rather than some amalgam of a lot of frequency and a
little dispersion. The paper demonstrates these issues on the basis of data from a variety of corpora. I then outline how to
design a dispersion measure that only measures dispersion and show that (i) it indeed measures information that is different from
frequency in an intuitive way and (ii) has a higher degree of predictive power of lexical decision times from the MALD database
than nearly all other measures in nearly all corpora tested.
Adelman, James S., Gordon D. A. Brown, & José F. Quesada. 2006. Contextual
Diversity, not word frequency, determines word-naming and lexical decision times. Psychological
Science 19(9). 814–823.
Baayen, R. Harald. 2008. Analyzing linguistic data: a
practical introduction to statistics with
R. Cambridge: Cambridge University Press.
Baayen, R. Harald, Petar Milin, & Michael Ramscar. 2016. Frequency
in lexical
processing. Aphasiaology 30(11). 1174–1220.
Balota, David A. & Daniel H. Spieler. 1998. The
utility of item level analyses in model evaluation: a reply to Seidenberg and
Plaut. Psychological
Science 9(3). 238–240.
Bestgen, Yves & Sylviane Granger. 2009. Quantifying the development of phraseological competence in L2 English writing: An automated approach. Journal of Second Language Writing 261. 28–41.
Brysbaert, Marc & Boris New. 2009. Moving
beyond Kučera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved
word frequency measure for American English. Behavior Research
Methods 41(4). 977–990.
Brysbaert, Marc, Pawel Mandera, Samantha F. McCormick, & Emmanuel Keuleers. 2019. Word prevalence norms for 62,000 English lemmas. Behavior Research Methods 511. 467–479.
Carroll, John B.1970. An alternative to Juilland’s
usage coefficient for lexical frequencies and a proposal for a standard frequency
index. Computer Studies in the Humanities and Verbal
Behaviour 3(2). 61–65.
Durrant, Phil & Norbert Schmitt. 2009. To what extent do native and non-native writers make use of collocations?International Review of Applied Linguistics 471. 157–177.
Ellis, Nick C.2007a. Language acquisition as
rational contingency learning. Applied
Linguistics 27(1). 1–24.
Ellis, Nick C.2007b. The Associative-Cognitive
CREED. In Bill VanPatten & Jessica Williams. (eds.), Theories
of second language acquisition: an
introduction, 77–95. Mahwah, NJ: Lawrence Erlbaum.
Ellis, Nick C., Rita Simpson-Vlach, & Carson Maynard. 2008. Formulaic language in native and second language speakers: Psycholinguistics, corpus linguistics, and TESOL. TESOL Quarterly 42(3). 375–396.
Evert, Stefan. 2009. Corpora
and collocations. In Anke Lüdeling & Merja. Kytö. (eds.), Corpus
Linguistics: An International
Handbook, Vol. 21, 1212–1248. Berlin & New York: Mouton de Gruyter.
Gries, Stefan Th.2010. Dispersions and adjusted
frequencies in corpora: further explorations. In Stefan Th. Gries, Stefanie Wulff, & Mark Davies. (eds.), Corpus
linguistic applications: current studies, new
directions, 197–212. Amsterdam: Rodopi.
Gries, Stefan Th.2019a. Ten lectures on corpus-linguistic
approaches: Applications for usage-based and psycholinguistic research. Leiden & Boston: Brill.
Gries, Stefan Th.2020. Analyzing
dispersion. In Magali Paquot & Stefan Th. Gries. (eds.), A
practical handbook of corpus
linguistics, 99–118. Berlin & New York: Springer.
Juilland, Alphonse G., Dorothy R. Brodin, & Catherine Davidovitch. 1970. Frequency
dictionary of French words. The Hague: Mouton de Gruyter.
Kromer, Victor. 2003. An
usage measure based on psychophysical relations. Journal of Quantitative
Linguistics 10(2). 177–186.
Oakes, Michael P. & Malcolm Farrow. 2007. Use
of the Chi-Squared Test to examine vocabulary differences in English language corpora representing seven different
countries. Literary and Linguistic
Computing 22(1). 85–99.
Pecina, Pavel. 2009. Lexical
association measures and collocation extraction. Language Resources and
Evaluation 44(1–2). 137–158.
Robertson, Stephen. 2004. Understanding
Inverse Document Frequency: on theoretical arguments of IDF. Journal of
Documentation 60(5). 503–520.
Rosengren, Inger. 1971. The
quantitative concept of language and its relation to the structure of frequency
dictionaries. Études de linguistique appliquée (Nouvelle
Série) 11. 103–127.
Savický, Petr & Jaroslava Hlaváčová. 2002. Measures
of word commonness. Journal of Quantitative
Linguistics 9(3), 215–231.
Schmid, Hans Joerg. 2010. Entrenchment, salience, and
basic levels. In Dirk Geeraerts & Hubert Cuyckens. (eds.), The
Oxford Handbook of Cognitive
Linguistics, 117–138. Oxford: Oxford University Press.
Siyanova-Chanturia, Anna. 2015. Collocation in beginner learner writing: A longitudinal study. System 531. 148–160.
Spärck Jones, Karen. 1972. A
statistical interpretation of term specificity and its application in information
retrieval. Journal of
Documentation 28(1). 11–21.
Spieler, Daniel H. & David A. Balota. 1997. Bringing
computational models of word naming down to the item level. Psychological
Science 8(6). 411–416.
Tucker, Benjamin V., Daniel Brennerm, D. Kyle Danielson, Matthew C. Kelley, Filip Nenadić, & Michelle Sims. 2019. The
Massive Auditory Lexical Decision (MALD) database. Behavior Research
Methods 511. 1187–1204.
Zagorsky, Jay L.2007. Do you have to be smart to be
rich? The impact of IQ on wealth, income and financial
distress. Intelligence 35(5). 489–501.
2022. Toward more careful corpus statistics: uncertainty estimates for frequencies, dispersions, association measures, and more. Research Methods in Applied Linguistics 1:1 ► pp. 100002 ff.
Gries, Stefan Th.
2025. Corpus Linguistics: Quantitative Methods. In The Encyclopedia of Applied Linguistics, ► pp. 1 ff.
Th Gries, Stefan
2024.
Corrections to Nelson (2023):
DP
norm
and
D
KLnorm
are Not Wrong on Pi at All
. Journal of Quantitative Linguistics 31:1 ► pp. 43 ff.
This list is based on CrossRef data as of 10 january 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.