Open Corpus Linguistics – or How to overcome common problems in dealing
with corpus data by adopting open research practices
In recent years, many researchers have called attention
to the fact that research results very often cannot be replicated – a
phenomenon that has been called replication crisis. The
replication crisis in linguistics is highly relevant to corpus-based
research: Many corpus studies are not directly replicable as the data on
which they are based are not readily available. Especially in English
linguistics, the full versions of many widely used corpora are still behind
paywalls, which means that they are not accessible to parts of the global
research community, and even when parts of the data are freely accessible,
this presents problems for state-of-the-art methods of data analysis. In
this paper, I discuss the challenges that have led to this situation and
address some possible solutions. In particular, I argue for using smaller
but openly available corpora whenever possible and for adopting open
research practices as far as possible even when using commercial
corpora.
Article outline
- 1.Introduction
- 2.Revisiting Rissanen’s problems
- 3.Open Corpus Linguistics: Perspectives and challenges
- 4.Conclusion: Open Corpus Linguistics in practice
-
Acknowledgements
-
Notes
-
References