Sensitivity of dispersion measures to distributional patterns and corpus design

Sönning, Lukas; Egbert, Jesse

doi:10.1075/ijcl.25008.son

Article In: International Journal of Corpus Linguistics: Online-First Articles

Sensitivity of dispersion measures to distributional patterns and corpus design

Lukas Sönning | University of Bamberg

Jesse Egbert | University of Northern Arizona

This content is being prepared for publication; it may be subject to changes.

Abstract

Recent work has shown that dispersion measures respond to multiple features in the data: Juilland’s D varies systematically with the number of corpus parts, and all commonly used indices are affected by the frequency of an item. This study uses a simulation approach to provide further insights into the sensitivity of dispersion measures to differences in corpus design (number of texts, average text length, distribution of text lengths) and distributional milieu (frequency and evenness of distribution). Our results suggest that, within the settings covered by our analysis, the factors frequency and evenness of distribution have roughly the same impact, though there is some variation among measures. The average text length emerges as another feature that leaves its mark on the observed scores. Finally, we note that D₂ exhibits the same weakness as D — it varies with the number of corpus parts that enter the analysis.

Keywords: dispersion, frequency, corpus design, text, methodology, construct validity

Article outline

1.Introduction
2.Measuring dispersion: Indices and their limitations
- 2.1Dispersion measures
- 2.2Unit of analysis
- 2.3Fragility of measures: Previous work
3.Extensions of prior work: Scope of the present study
- 3.1Average text length
- 3.2Distribution of text lengths
- 3.3Factors considered in the present study
4.Method
- 4.1Simulation study: General procedure
- 4.2Distributional parameters: Negative binomial model
- 4.3Transformation of dispersion scores
- 4.4Analysis of simulation results
5.Results
6.Summary of findings and implications
7.Outlook
Notes
References

References (32)

References

Baayen, R. H. (2001). Word frequency distributions. Springer Dordrecht.

Biber, D., Reppen, R., Schnur, E., & Ghanem, R. (2016). On the (non)utility of Juilland’s D to measure lexical dispersion in large corpora. International Journal of Corpus Linguistics, 21(4). 439–464.

Burch, B., Egbert, J., & Biber, D. (2017). Measuring and interpreting lexical dispersion in corpus linguistics. Journal of Research Design and Statistics in Linguistics and Communication Science, 3(2). 189–216.

Carroll, J. B. (1970). An alternative to Juilland’s usage coefficient for lexical frequencies and a proposal for a standard frequency index. Computer Studies in the Humanities and Verbal Behaviour, 3(2). 61–65. [URL]

Church, K. W., & Gale, W. A. (1995). Poisson mixtures. Natural Language Engineering, 1(2). 163–190.

Davies, M. (2008). The Corpus of Contemporary American English (COCA). [URL]

Egbert, J., & Burch, B. (2023). Which words matter most? Operationalizing lexical prevalence for rank-ordered word lists. Applied Linguistics, 44(1). 103–126.

Egbert, J., Burch, B., & Biber, D. (2020). Lexical dispersion and corpus design. International Journal of Corpus Linguistics, 25(1). 89–115.

Francis, W. N., & Kučera, H. (1964). Manual of information to accompany a standard corpus of present-day edited American English for use with digital computers. Department of Linguistics, Brown University. [URL]

Greenbaum, S., & Nelson, G. (1996). The International Corpus of English (ICE) project. World Englishes, 15(1). 3–15.

Gries, S. Th. (2008). Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics, 13(4). 403–437.

(2020a). Ten lectures on corpus linguistics with R: Applications for usage-based and psycholinguistic research. Brill.

(2020b). Analyzing dispersion. In M. Paquot & S. Th. Gries (Eds.), A practical handbook of corpus linguistics (pp. 99–118). Springer.

(2021). A new approach to (key) keywords analysis: Using frequency, and now also dispersion. Research in Corpus Linguistics, 9(2). 1–33.

(2022). What do (most of) our dispersion measures measure (most)? Dispersion? Journal of Second Language Studies, 5(2). 171–205.

(2024). Frequency, dispersion, association, and keyness: Revising and tupleizing corpus-linguistic measures. John Benjamins.

Halvorsen, K. T. (1991). Value splitting involving more factors. In D. C. Hoaglin, F. Mosteller, & J. W. Tukey (Eds.), Fundamentals of exploratory analysis of variance (pp. 72–113). Wiley.

Juilland, A. G., & E. Chang-Rodríguez. (1964). Frequency dictionary of Spanish words. Mouton de Gruyter.

Katz, S. M. (1996). Distribution of content words and phrases in text and language modelling. Natural Language Engineering, 2(1). 15–59.

Keniston, H. (1920). Common words in Spanish. Hispania, 3(2). 85–96.

Long, J. S. (1997). Regression models for categorical and limited dependent variables. Sage.

Lyne, A. A. (1985). The vocabulary of French business correspondence. Slatkine-Champion.

Mosteller, F., & D. L. Wallace. (1984). Applied Bayesian inference: The case of The Federalist Papers. Springer.

R Core Team. (2022). R: A language and environment for statistical computing (Version 4.3.1). R Foundation for Statistical Computing. [URL]

Rigby, R. A. & M. D. Stasinopoulos. (2005). Generalized additive models for location, scale and shape. Applied Statistics, 54(3). 507–554.

Rosengren, I. (1971). The quantitative concept of language and its relation to the structure of frequency dictionaries. Études de linguistique appliquée (Nouvelle Série), 11, 103–127.

Sarkar, D. (2008). Lattice: Multivariate data visualization with R. Springer.

Sönning, L. (2023a). The negative binomial distribution: A visual explanation. Statistics for linguist(ic)s. [URL]

(2023b). Different parameterizations of the negative binomial distribution. Statistics for linguist(ic)s. [URL]

(2025). Advancing our understanding of dispersion measures in corpus research. Corpora, 20(1). 3–35.

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer.

Winter, B., & P.-C. Bürkner. (2021). Poisson regression for linguists: A tutorial introduction to modelling count data with brms. Language and Linguistics Compass, 15(11), e12439.