Down-sampling from hierarchically structured corpus data
Resource constraints often force researchers to downsize the list of tokens returned by a corpus query. This paper sketches a methodology for down-sampling and offers a survey of current practices. We build on earlier work and extend the evaluation of down-sampling designs to settings where tokens are clustered by text file and lexeme. Our case study deals with third-person present-tense verb inflection in Early Modern English and focuses on five predictors: year, gender, genre, frequency, and phonological context. We evaluate two strategies for selecting 2,000 (out of 11,645) tokens: simple down-sampling, where each hit has the same selection probability; and structured down-sampling, where this probability is inversely proportional to the author- and verb-specific token count. We form 500 subsamples using each scheme and compare regression results to a reference model fit to the full set of cases. We observe that structured down-sampling shows better performance on several evaluation criteria.
Article outline
- 1.Introduction
- 2.Down-sampling in corpus-based work
- 2.1Down-sampling designs: Design features and terminology
- 2.2A survey of down-sampling designs in corpus-based work
- 2.3Previous methodological work
- 3.Methodology
- 3.1Case study and corpus data
- 3.1.1Third-person verb inflection in Early Modern English
- 3.1.2Data preparation
- 3.1.3Data structure
- 3.2Evaluation method
- 3.2.1Implementation of down-sampling designs
- 3.2.2The reference model
- 3.2.3Evaluation of down-sampling designs
- 4.Results
- 5.Summary and outlook
- Acknowledgements
- Notes
-
References