Engaging with bad (meta)data in historical corpus linguistics
In this chapter, we discuss some common pitfalls related
to historical data and its use in linguistic analysis. We argue that the
“philologist’s dilemma”, as originally proposed by Rissanen (1989), should be reconceptualized to meet
the needs of the fast-evolving field of corpus linguistics, where scholars
make increasing use of big-data resources and sophisticated statistical
modelling. By providing examples of errors and uncertainties related to, for
example, corpus metadata, sampling, balance, and OCR accuracy, we argue that
corpus linguists should pay increasingly close attention to the sampling and
annotation principles employed in the compilation of historical corpora as
well as to the quality of the linguistic data. We propose that the principle
of “knowing one’s corpus” in terms of its compilation principles has become
all the more important in the age of big-data corpora, where it is not
feasible for individual researchers, or corpus compilers, to validate their
data manually.
Article outline
- 1.Introduction
- 2.POS annotation in diachronic datasets
- 2.1Accounting for category change
- 2.2Theoretical choices in the design of the annotation scheme
- 2.3Annotation tailored to specific research questions
- 3.Large corpora
- 3.1Inaccuracies in text sampling
- 3.2Changes in the balance of subgenres
- 4.Historical databases
- 4.1Issues with balance and metadata
- 4.2OCR errors
- 4.2.1Hapax legomena
- 4.2.2Historical lexis
- 5.Discussion and conclusion
-
Acknowledgements
-
Notes
-
References