Open Corpus Linguistics – or How to overcome common problems in dealing with corpus data by adopting open research practices

Hartmann, Stefan

Part of

Challenges in Corpus Linguistics: Rethinking corpus compilation and analysis
Edited by Mark Kaunisto and Marco Schilk
[Studies in Corpus Linguistics 118] 2024
► pp. 89–106

Open Corpus Linguistics – or How to overcome common problems in dealing with corpus data by adopting open research practices

Stefan Hartmann | Heinrich Heine University Düsseldorf

In recent years, many researchers have called attention to the fact that research results very often cannot be replicated – a phenomenon that has been called replication crisis. The replication crisis in linguistics is highly relevant to corpus-based research: Many corpus studies are not directly replicable as the data on which they are based are not readily available. Especially in English linguistics, the full versions of many widely used corpora are still behind paywalls, which means that they are not accessible to parts of the global research community, and even when parts of the data are freely accessible, this presents problems for state-of-the-art methods of data analysis. In this paper, I discuss the challenges that have led to this situation and address some possible solutions. In particular, I argue for using smaller but openly available corpora whenever possible and for adopting open research practices as far as possible even when using commercial corpora.

Keywords: replicability, open research, accessibility, transparency, representativeness

Article outline

1.Introduction
2.Revisiting Rissanen’s problems
3.Open Corpus Linguistics: Perspectives and challenges
4.Conclusion: Open Corpus Linguistics in practice
Author queries
Acknowledgements
Notes
References

This content is being prepared for publication; it may be subject to changes.

References

Baker, Paul, Hardie, Andrew & McEnery, Tony

2006 A Glossary of Corpus Linguistics. Edinburgh: EUP.

Barbaresi, Adrien

2021 Trafilatura: A web scraping library and command-line tool for text discovery and extraction. In Proceedings of the Annual Meeting of the ACL, System Demonstrations. [URL] (25 October 2022).

Baroni, Marco, Bernardini, Silvia, Ferraresi, Adriano & Zanchetta, Eros

2009 The WaCky Wide Web: A collection of very large linguistically processed web-crawled corpora. Language Resources and Evaluation 43(3): 209–226.

Biber, Douglas

1993 Representativeness in corpus design. Literary and Linguistic Computing 8: 243–257.

Collister, Lauren B.

2022 Copyright and sharing linguistic data. In The Open Handbook of Linguistic Data Management, Andrea L. Berez-Kroeker, Bradley McDonnell, Eve Koller & Lauren B. Collister (eds), 117–128. Cambridge MA: The MIT Press.

Dijk, Teun A. van

2005 Contextual knowledge management in discourse prediction: A CDA perspective. In A New Agenda in (Critical) Discourse Analysis: Theory, Methodology and Interdisciplinarity [Discourse Approaches to Politics, Society and Culture 13], Ruth Wodak & Paul A. Chilton (eds), 71–100. Amsterdam: John Benjamins.

Egbert, Jesse, Larsson, Tove & Biber, Douglas

2020 Doing Linguistics with a Corpus: Methodological Considerations for the Everyday User [Elements in Corpus Linguistics]. Cambridge: CUP.

Eve, Martin Paul

2014 Open Access and the Humanities: Con;//doi.org/texts, Controversies and the Future. Cambridge: CUP.

Garellek, Marc, Simpson, Adrian, Roettger, Timo B., Recasens, Daniel, Niebuhr, Oliver, Mooshammer, Christine, Michaud, Alexis et al.

2020 Letter to the editor: Toward open data policies in phonetics: What we can gain and how we can avoid pitfalls. Journal of Speech Sciences 9: 3–16.

Gärtner, Markus, Kleinkopf, Felicitas, Andresen, Melanie & Hermann, Sibylle

2021 Corpus reusability and copyright – Challenges and opportunities. In Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-9) 2021. Limerick, 12 July 2021 (Online-Event), Harald Lüngen, Marc Kupietz, Piotr Bański, Adrien Barbaresi, Simon Clematide & Ines Pisetta (eds), 10–19. Leibniz-Institut für Deutsche Sprache. [URL] (30 October, 2022).

Glenberg, Arthur M. & Kaschak, Michael P.

2002 Grounding language in action. Psychonomic Bulletin & Review 9(3): 558–565.

Goldberg, Adele E.

1995 Constructions: A Construction Grammar Approach to Argument Structure. Chicago IL: The University of Chicago Press.

Goodman, Steven N., Fanelli, Daniele & Ioannidis, John P. A.

2016 What does research reproducibility mean? Science Translational Medicine 8(341): 341ps12.

Hüffmeier, Joachim, Mazei, Jens & Schultze, Thomas

2016 Reconceptualizing replication as a sequence of different studies: A replication typology. Journal of Experimental Social Psychology 66: 81–92.

Hunston, Susan

2008 Collection strategies and design decisions. In Corpus Linguistics: An International Handbook [HSK 29.1], Anke Lüdeling & Merja Kytö (eds), 154–168. Berlin: Walter de Gruyter.

Kupietz, Marc, Belica, Cyril, Keibel, Holger & Witt, Andreas

2010 The German Reference Corpus DeReKo: A primordial sample for linguistic research. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010), Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, Mike Rosner & Daniel Tapias (eds), 1848–1854. Valletta: European Language Resources Association. [URL] (19 May 2024).

Kytö, Merja & Rissanen, Matti

1988 The Helsinki Corpus of English Texts: Classifying and coding the diachronic part. In Corpus Linguistics, Hard and Soft: Proceedings of the Eighth International Conference on English Language Research on Computerized Corpora, Merja Kytö, Ossi Ihalainen & Matti Rissanen (eds), 169–179. Amsterdam: Rodopi.

Larsson, Tove

2021 Has ‘the replication crisis’ reached corpus linguistics? Blog Linguistics with a Corpus. [URL] (25 October 2022).

Lehmberg, Timm, Rehm, Georg, Witt, Andreas & Zimmermann, Felix

2008 Digital text collections, linguistic research data, and mashups: Notes on the legal situation. Library Trends 57: 52–71.

Lewis, William D., Farrar, Scott & Langendoen, D. Terence

2006 Linguistics in the Internet age: Tools and fair use. In Proceedings of the EMELD’06 Workshop on Digital Language Documentation: Tools and Standards: The State of the Art. Lansing, MI. [URL] (6 January 2023).

Machery, Edouard

2020 What is a replication? Philosophy of Science 87(4): 545–567.

McCreadie, Richard, Soboroff, Ian, Lin, Jimmy, Macdonald, Craig, Ounis, Iadh & McCullough, Dean

2012 On building a reusable Twitter corpus. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval – SIGIR ’12, 1113. Portland OR: ACM Press.

Morey, Richard D. et al.

2021 A pre-registered, multi-lab non-replication of the action-sentence compatibility effect (ACE). Psychonomic Bulletin & Review 29: 613–626.

Omidian, Taha, Balance, Oliver James & Siyanova-Chanturia, Anna

2021 Replicating corpus-based research in English for academic purposes: Proposed replication of Cortes (2013) and Biber and Gray (2010). Language Teaching, 1–9.

Perek, Florent

2021 Distributional semantic models for English verbs and nouns. Open Science Framework.

Rehm, Georg, Witt, Andreas, Zinsmeister, Heike & Dellert, Johannes

2015 Corpus masking: Legally bypassing licensing restrictions for the free distribution of text collections. In Digital Humanities 2007, 2nd edn, Sara Schmidt, Ray Siemens, Amit Kumar & John Unsworth (eds), 166–170. Urbana-Champaign IL: University of Illinois. [URL] (19 May 2024).

Rissanen, Matti

1989 Three problems connected with the use of diachronic corpora. ICAME Journal 13: 16–19.

Rosati, Eleonora

2021 The DSM Directive two years on: Do things ever get easier? IIC – International Review of Intellectual Property and Competition Law 52(9): 1139–1142.

Schäfer, Roland & Bildhauer, Felix

2012 Building large corpora from the web using a new efficient tool chain. In Proceedings of LREC 2012, Nicoletta Calzolari, Khalid Choukri, Terry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk & Stelios Piperidis (eds), 486–493. [URL] (25 October 2022).

2013 Web Corpus Construction. San Rafael CA: Morgan & Claypool.

Schmidt, Stefan

2009 Shall we really do it again? The powerful concept of replication is neglected in the social sciences. Review of General Psychology 13(2): 90–100.

Schneider, Roman

2020 A corpus linguistic perspective on contemporary German pop lyrics with the multi-layer annotated “Songkorpus”. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk & Stelios Piperidis (eds), 842–848. Marseille: European Language Resources Association. [URL] (19 May 2024).

Sönning, Lukas & Werner, Valentin

2021 The replication crisis, scientific revolutions, and linguistics. Linguistics 59(5): 1179–1206.

Stefanowitsch, Anatol & Gries, Stefan T.

2003 Collostructions: Investigating the interaction of words and constructions. International Journal of Corpus Linguistics 8(2): 209–243.

Vandeweerd, Nathan, Housen, Alex & Paquot, Magali

2021 Applying phraseological complexity measures to L2 French: A partial replication study. International Journal of Learner Corpus Research 7(2): 197–229.

Wilkinson, Mark D. et al.

2016 The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data 3(1): 160018.

Winter, Bodo & Grice, Martine

2021 Independence and generalizability in linguistics. Linguistics 59(5): 1251–1277.

Yamamoto, Mutsumi

1999 Animacy and Reference. A Cognitive Approach to Corpus Linguistics [Studies in Language Companion Series 46]. Amsterdam: John Benjamins.

Zaenen, Annie, Carletta, Jean, Garretson, Gregory, Bresnan, Joan, Koontz-Garboden, Andrew, Nikitina, Tatiana, O’Connor, M. Catherine & Wasow, Tom

2004 Animacy encoding in English: Why and how. In DiscAnnotation ’04, Bonnie Webber & Donna Byron (eds), 118–125. Stroudsburg PA: Association for Computational Linguistics.

Zwaan, Rolf A., Etz, Alexander, Lucas, Richard E. & Donnellan, M. Brent

2018 Making replication mainstream. Behavioral and Brain Sciences 41, E120.