Any empirical study of language use is ultimately reliant upon linguistics data, and is commonly (but not necessarily) associated with linguistic corpora or other kinds of databases representing samples of natural language use. The goal of this chapter is to cover more precisely what is involved when we speak about collecting and maintaining linguistics data sets for scientific purposes. This includes covering the history of the topic in linguistics and beyond, but especially focusing on aspects that have become particularly relevant since the digital turn, in other words, since digital technologies have evolved that allow non-data-scientists to collect, analyze and preserve unprecedentedly large data sets on their own with readily available technology. Nowadays, most people with a bit of training can use their own personal computers and freely available software to work with large sets of digital data throughout the process, from collection and analysis to preservation. While we highlight aspects that are particularly relevant for pragmatics, much of this presentation is equally valid for other fields that collect or analyze usage-based linguistic data, or indeed any other fields working with empirical data collected as a sample of human behavior related to language and communication. After this introductory section, we cover some of the most important concepts (data and metadata, legal aspects, ethical aspects, data formats, versioning, digitization, archiving, discoverability, reproducibility and citation).
References
Aarts, Jan
2011 “Corpus analysis.” In Vol. 15, Handbook of Pragmatics, ed. by Jan-Ola Östman, and Jef Verschueren, 1–14. Amsterdam: John Benjamins Publishing Company.
Andreassen, Helene N., Andrea L. Berez-Kroeker, Lauren Collister, Philipp Conzett, Christopher Cox, Koenraad De Smedt, Bradley McDonnell, and Research Data Alliance Linguistic Data Interest Group
2019Tromsø Recommendations for Citation of Research Data in Linguistics. RDA Linguistics Data Interest Group.
Austin, Peter K.
2006 “Data and language documentation.” In Essentials of Language Documentation, ed. by Jost Gippert, Ulrike Mosel, and Nikolaus Himmelmann, 87–112. Berlin: Mouton de Gruyter.
Austin, Peter K.
2013 “Language documentation and meta-documentation.” In Keeping Languages Alive: Documentation, Pedagogy and Revitalisation, ed. by Mari C. Jones, and Sarah Ogilvie. Cambridge: Cambridge University Press, 3–15.
Berez-Kroeker, Andrea L., Lauren Gawne, Susan Smythe Kung, Barbara F. Kelly, Tyler Heston, Gary Holton, Peter Pulsifer, et al.Woodbury
2017 “Reproducible research in linguistics: A position statement on data citation and attribution in our field.” Linguistics 56.1, 1–18.
Berez-Kroeker, Andrea L., Bradley McDonnell, Eve Koller, Lauren B. Collister
(eds)2022The Open Handbook of Linguistic Data Management. Cambridge, MA: The MIT Press.
Bird, Steven, and Gary Simons
2003 “Seven dimensions of portability for language documentation and description.” Language 79 (3), 557–582.
Conzett, Philipp, and Koenraad De Smedt
2022 “Guidance for citing linguistic data.” In The Open Handbook of Linguistic Data Management, ed. by Andrea L. Berez-Kroeker, Bradley McDonnell, Eve Koller, and Lauren B. Collister, Ch. 11. Cambridge, MA: The MIT Press.
Darnell, Regna
2003 “Franz Boas.” In Vol. 9, Handbook of Pragmatics, ed. by Jan-Ola Östman, and Jef Verschueren. Amsterdam: John Benjamins Publishing Company.
Gippert, Jost, Ulrike Mosel, and Nikolaus Himmelmann
(eds)2006Essentials of Language Documentation. Berlin: Mouton de Gruyter.
Grenoble, Lenore A., and Louanna Furbee
(eds)2010Language Documentation. Practice and Values. Amsterdam: John Benjamins Publishing Company.
Henke, Ryan, and Andrea L. Berez-Kroeker
2016 “A brief history of archiving in language documentation, with an annotated bibliography.” Language Documentation & Conservation 10, 411–457.
Himmelmann, Nikolaus P.
1998 “Documentary and descriptive linguistics.” Linguistics. An Interdisciplinary Journal of the Language Sciences 36, 161–195.
Himmelmann, Nikolaus P.
2006 “Language documentation. What is it and what is it good for?” In Essentials of language documentation, ed. by Jost Gippert, Ulrike Mosel, and Nikolaus Himmelmann, 1–30. Berlin: Mouton de Gruyter.
Jucker, Andreas H.
2013 “Corpus pragmatics.” In Vol. 17, Handbook of Pragmatics, ed. by Jan-Ola Östman and Jef Verschueren, 1–18. Amsterdam: John Benjamins Publishing Company.
Seyfeddinipur, Mandana, Felix Ameka, Lissant Bolton, Jonathan Blumtritt, Brian Carpenter, Hilaria Cruz, Sebastian Drude, et al.
2019 “Public access to research data in language documentation. Challenges and possible strategies”. Language Documentation & Conservation 13, 545–563. URL: http://hdl.handle.net/10125/24901
Wilkinson, Mark D., Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg, et al.
2016 “The FAIR guiding principles for scientific data management and stewardship”. Scientific Data 3.
Woodbury, Anthony C.
2003 “Defining documentary linguistics”. In Vol. 1, Language Documentation and Description, ed. by Peter K. Austin, 35–51. London: SOAS, University of London.