Working with language data

Joshua WilburMichael Rießler

Table of contents

Any empirical study of language use is ultimately reliant upon linguistics data, and is commonly (but not necessarily) associated with linguistic corpora or other kinds of databases representing samples of natural language use. The goal of this chapter is to cover more precisely what is involved when we speak about collecting and maintaining linguistics data sets for scientific purposes. This includes covering the history of the topic in linguistics and beyond, but especially focusing on aspects that have become particularly relevant since the digital turn, in other words, since digital technologies have evolved that allow non-data-scientists to collect, analyze and preserve unprecedentedly large data sets on their own with readily available technology. Nowadays, most people with a bit of training can use their own personal computers and freely available software to work with large sets of digital data throughout the process, from collection and analysis to preservation. While we highlight aspects that are particularly relevant for pragmatics, much of this presentation is equally valid for other fields that collect or analyze usage-based linguistic data, or indeed any other fields working with empirical data collected as a sample of human behavior related to language and communication. After this introductory section, we cover some of the most important concepts (data and metadata, legal aspects, ethical aspects, data formats, versioning, digitization, archiving, discoverability, reproducibility and citation).

Full-text access is restricted to subscribers. Log in to obtain additional credentials. For subscription information see Subscription & Price.

References

Aarts, Jan
2011 “Corpus analysis.” In Vol. 15, Handbook of Pragmatics, ed. by Jan-Ola Östman, and Jef Verschueren, 1–14. Amsterdam: John Benjamins Publishing Company. DOI logoGoogle Scholar
Andreassen, Helene N., Andrea L. Berez-Kroeker, Lauren Collister, Philipp Conzett, Christopher Cox, Koenraad De Smedt, Bradley McDonnell, and Research Data Alliance Linguistic Data Interest Group
2019Tromsø Recommendations for Citation of Research Data in Linguistics. RDA Linguistics Data Interest Group. DOI logoGoogle Scholar
Austin, Peter K.
2006 “Data and language documentation.” In Essentials of Language Documentation, ed. by Jost Gippert, Ulrike Mosel, and Nikolaus Himmelmann, 87–112. Berlin: Mouton de Gruyter.Google Scholar
2013 “Language documentation and meta-documentation.” In Keeping Languages Alive: Documentation, Pedagogy and Revitalisation, ed. by Mari C. Jones, and Sarah Ogilvie. Cambridge: Cambridge University Press, 3–15. DOI logoGoogle Scholar
Berez-Kroeker, Andrea L., Lauren Gawne, Susan Smythe Kung, Barbara F. Kelly, Tyler Heston, Gary Holton, Peter Pulsifer, et al. Woodbury
2017 “Reproducible research in linguistics: A position statement on data citation and attribution in our field.” Linguistics 56.1, 1–18. DOI logoGoogle Scholar
Berez-Kroeker, Andrea L., Bradley McDonnell, Eve Koller, Lauren B. Collister
(eds) 2022The Open Handbook of Linguistic Data Management. Cambridge, MA: The MIT Press. DOI logoGoogle Scholar
Bird, Steven, and Gary Simons
2003 “Seven dimensions of portability for language documentation and description.” Language 79 (3), 557–582. DOI logoGoogle Scholar
Conzett, Philipp, and Koenraad De Smedt
2022 “Guidance for citing linguistic data.” In The Open Handbook of Linguistic Data Management, ed. by Andrea L. Berez-Kroeker, Bradley McDonnell, Eve Koller, and Lauren B. Collister, Ch. 11. Cambridge, MA: The MIT Press. DOI logoGoogle Scholar
Darnell, Regna
2003 “Franz Boas.” In Vol. 9, Handbook of Pragmatics, ed. by Jan-Ola Östman, and Jef Verschueren. Amsterdam: John Benjamins Publishing Company. DOI logoGoogle Scholar
Gippert, Jost, Ulrike Mosel, and Nikolaus Himmelmann
(eds) 2006Essentials of Language Documentation. Berlin: Mouton de Gruyter. DOI logoGoogle Scholar
Grenoble, Lenore A., and Louanna Furbee
(eds) 2010Language Documentation. Practice and Values. Amsterdam: John Benjamins Publishing Company. DOI logoGoogle Scholar
Henke, Ryan, and Andrea L. Berez-Kroeker
2016 “A brief history of archiving in language documentation, with an annotated bibliography.” Language Documentation & Conservation 10, 411–457.Google Scholar
Himmelmann, Nikolaus P.
1998 “Documentary and descriptive linguistics.” Linguistics. An Interdisciplinary Journal of the Language Sciences 36, 161–195. DOI logoGoogle Scholar
2006 “Language documentation. What is it and what is it good for?” In Essentials of language documentation, ed. by Jost Gippert, Ulrike Mosel, and Nikolaus Himmelmann, 1–30. Berlin: Mouton de Gruyter.Google Scholar
Jucker, Andreas H.
2013 “Corpus pragmatics.” In Vol. 17, Handbook of Pragmatics, ed. by Jan-Ola Östman and Jef Verschueren, 1–18. Amsterdam: John Benjamins Publishing Company. DOI logoGoogle Scholar
Seyfeddinipur, Mandana, Felix Ameka, Lissant Bolton, Jonathan Blumtritt, Brian Carpenter, Hilaria Cruz, Sebastian Drude, et al.
2019 “Public access to research data in language documentation. Challenges and possible strategies”. Language Documentation & Conservation 13, 545–563. URL: http://​hdl​.handle​.net​/10125​/24901
Wilkinson, Mark D., Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg, et al.
2016 “The FAIR guiding principles for scientific data management and stewardship”. Scientific Data 3. DOI logoGoogle Scholar
Woodbury, Anthony C.
2003 “Defining documentary linguistics”. In Vol. 1, Language Documentation and Description, ed. by Peter K. Austin, 35–51. London: SOAS, University of London.Google Scholar