Down-sampling from hierarchically structured corpus data

Sönning, Lukas

doi:10.1075/ijcl.23079.son

Article published In:

International Journal of Corpus Linguistics: Online-First Articles

Down-sampling from hierarchically structured corpus data

Lukas Sönning | University of Bamberg

Resource constraints often force researchers to downsize the list of tokens returned by a corpus query. This paper sketches a methodology for down-sampling and offers a survey of current practices. We build on earlier work and extend the evaluation of down-sampling designs to settings where tokens are clustered by text file and lexeme. Our case study deals with third-person present-tense verb inflection in Early Modern English and focuses on five predictors: year, gender, genre, frequency, and phonological context. We evaluate two strategies for selecting 2,000 (out of 11,645) tokens: simple down-sampling, where each hit has the same selection probability; and structured down-sampling, where this probability is inversely proportional to the author- and verb-specific token count. We form 500 subsamples using each scheme and compare regression results to a reference model fit to the full set of cases. We observe that structured down-sampling shows better performance on several evaluation criteria.

Keywords: down-sampling, thinning, methodology, data structure, study design

Article outline

1.Introduction
2.Down-sampling in corpus-based work
- 2.1Down-sampling designs: Design features and terminology
- 2.2A survey of down-sampling designs in corpus-based work
- 2.3Previous methodological work
3.Methodology
- 3.1Case study and corpus data
  - 3.1.1Third-person verb inflection in Early Modern English
  - 3.1.2Data preparation
  - 3.1.3Data structure
- 3.2Evaluation method
  - 3.2.1Implementation of down-sampling designs
  - 3.2.2The reference model
  - 3.2.3Evaluation of down-sampling designs
4.Results
5.Summary and outlook
Acknowledgements
Notes
References

Published online: 25 March 2024

https://doi.org/10.1075/ijcl.23079.son

References

Agresti, A.

(2013) Categorical Data Analysis (3rd ed.). Wiley.

Baayen, R. H.

(2008) Analyzing Linguistic Data: A Practical Introduction to Statistics Using R. Cambridge University Press.

BNC Consortium

(2007) British National Corpus (version 3, BNC XML ed.). [URL]

Cox, D. R., & Donnelly, C. A.

(2011) Principles of Applied Statistics. Cambridge University Press.

Gelman, A., Hill, J., & Vehtari, A.

(2020) Regression and Other Stories. Cambridge University Press.

Gries, S. T., & Hilpert, M.

(2010) Modeling diachronic change in the third person singular: A multifactorial, verb- and author-specific exploratory approach. English Language and Linguistics, 14 (3), 293–320.

Jenset, G. B., & McGillivray, B.

(2017) Quantitative Historical Linguistics: A Corpus Framework. Oxford University Press.

Kroch, A., Santorini, B., & Delfs, L.

(2004) The Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME). [URL]

Kytö, M.

(1993) Third-person singular verb inflection in early British and American English. Language Variation and Change, 5 (2), 113–139.

Lohr, S. L.

(2022) Sampling: Design and Analysis (3rd ed.). CRC Press.

Meyerhoff, M.

(2011) Introducing Sociolinguistics (2nd ed.). Routledge.

Nevalainen, T., & Raumolin-Brunberg, H.

(2003) Historical Sociolinguistics: Language Change in Tudor and Stuart England. Pearson Education.

Rothman, K. J., Greenland, S., & Lash, T. L.

(2008) Case-control studies. In K. J. Rothman, S. Greenland, & T. L. Lash (Eds.), Modern Epidemiology (3rd ed.) (pp. 111–127). Lippincott Williams & Wilkins.

Singer, J. D.

(1991) Types of factors and their structural layouts. In D. C. Hoaglin, F. Mosteller, & J. W. Tukey (Eds.), Fundamentals of Exploratory Analysis of Variance (pp. 50–71). Wiley.

Smith, N., & Waters, C.

(2019) Variation and change in a specialized register: A comparison of random and sociolinguistic sampling outcomes in Desert Island Discs. International Journal of Corpus Linguistics, 24 (2), 169–201.

Sönning, L.

(2023) Data from Jenset & McGillivray (2017), adapted for “Down-sampling from hierarchically structured corpus data”. DataverseNO, V1.

Sönning, L., & Krug, M.

(2022) Comparing study designs and down-sampling strategies in corpus analysis: The importance of speaker metadata in the BNCs of 1994 and 2014. In O. Schützler & J. Schlüter (Eds.), Data and Methods in Corpus Linguistics: Comparative Approaches (pp. 127–160). Cambridge University Press.

Vaden, K. I., Halpin, H. R., & Hickok, G. S.

(2009) Irvine Phonotactic Online Dictionary, (Version 2.0). [Data file]. [URL]

Winter, B., & Grice, M.

(2021) Independence and generalizability in linguistics. Linguistics, 59 (5), 1251–1277.