Diachronic collocations, genre, and DiaCollo
This chapter presents the formal basis for diachronic collocation profiling as implemented in
the open-source software tool “DiaCollo” and sketches some potential applications to multi-genre diachronic corpora.
Explicitly developed for the efficient extraction, comparison, and interactive visualization of collocations from a
diachronic text corpus, DiaCollo is suitable for processing collocation pairs whose association strength depends on
extralinguistic features such as the date of occurrence or text genre. By tracking changes in a word’s typical
collocates over time, DiaCollo can help to provide a clearer picture of diachronic changes in the word’s usage,
especially those related to semantic shift or discourse environment. Use of the flexible DDC search engine back-end allows user queries to make explicit reference to genre and other
document-level metadata, thus allowing e.g. independent genre-local profiles or cross-genre comparisons. In addition
to traditional static tabular display formats, a web-service plugin also offers a number of intuitive interactive
online visualizations for diachronic profile data for immediate inspection.
Article outline
- 1.Introduction
- 2.Related work
- 3.Implementation
- 3.1Overview
- 3.2Corpus data
- 3.3Co-occurrence frequencies
- 3.3.1Native co-occurrence relation
- 3.3.2Term × document matrix co-occurrence relation
- 3.3.3DDC co-occurrence relation
- 3.4Scoring and pruning
- 3.5Comparisons
- 3.6Output & visualization
- 4.Examples
- 4.1Adjectival attribution: What makes a “man”?
- 4.2Pronominal adverbs and deictic locality
- 5.Conclusion
-
Notes
-
References
References (40)
References
Baker, Paul, Gabrielatos, Costas, Khosravinik, Majid, Krzyżanowski, Michał, McEnery, Tony & Wodak, Ruth. 2008. A useful methodological synergy? Combining critical discourse analysis and corpus linguistics to
examine discourses of refugees and asylum seekers in the UK press. Discourse & Society 19(3): 273–306.
Berry, Michael W., Dumais, Susan T. & O’Brien, Gavin. 1995. Using linear algebra for intelligent information retrieval. SIAM Review 37(4): 573–595. <[URL].
Biber, Douglas, Johansson, Stig, Leech, Geoffrey, Conrad, Susan & Finegan, Edward. 1999. Longman Grammar of Spoken and Written English. London: Longman.
Blei, David M., Ng, Andrew Y. & Jordan, Michael I. 2003. Latent Dirichlet allocation. Journal of machine Learning Research 3: 993–1022. <[URL]>
Church, Kenneth W. & Hanks, Patrick. 1990. Word association norms, mutual information, and lexicography. Computational Linguistics 16(1):22–29.
Davies, Mark. 2012. Expanding horizons in historical linguistics with the 400-million word Corpus of Historical
American English. Corpora 7(2): 121–157. <[URL].
Didakowski, Jörg & Geyken, Alexander. 2003. From DWDS corpora to a German word profile – methodological problems and solutions. In Network Strategies, Access Structures and Automatic Extraction of Lexicographical Information [OPAL X], Andrea Abel & Lothar Lemnitzer (eds). Mannheim: IDS. <[URL]>
Duff, Iain S., Grimes, Roger G. & Lewis, John G. 1989. Sparse matrix test problems. ACM Transactions on Mathematical Software (TOMS), 15(1): 1–14.
Evert, Stefan. 2005. The Statistics of Word Cooccurrences: Word Pairs and Collocations. PhD dissertation, University of Stuttgart. <[URL]>
Evert, Stefan. 2008. Corpora and collocations. In Corpus Linguistics. An International Handbook, Anke Lüdeling & Merja Kytö (eds), 1212–1248. Berlin: Mouton de Gruyter.
Fielding, Roy T. 2000. Architectural styles and the design of network-based software architectures. PhD dissertation, University of California, Irvine. <[URL]>
Firth, John Rupert. 1957. Papers in Linguistics 1934–1951. London: OUP.
Galbraith, Mary. 1995. Deictic shift theory and the poetics of involvement in narrative. In Deixis in Narrative: A Cognitive Science Perspective, Judith F. Duchan, Gail A. Bruder & Lynne E. Hewitt (eds), 19–59. Hillsdale NJ: Lawrence Erlbaum Associates.
Geyken, Alexander. 2013. Wege zu einem historischen Referenzkorpus des Deutschen: Das Projekt Deutsches
Textarchiv. In Perspektiven einer corpusbasierten historischen Linguistik und Philologie [Thesaurus Linguae Aegyptiae 4], Ingelore Hafemann (eds), 221–234. Berlin: Berlin-Brandenburgische Akademie der Wissenschaften. <[URL]>
Geyken, Alexander, Barbaresi, Adrien, Didakowski, Jörg, Jurish, Bryan, Wiegand, Frank & Lemnitzer, Lothar. 2017. Die Korpusplattform des “Digitalen Wörterbuchs der deutschen Sprache” (dwds). Zeitschrift für Germanistische Linguistik 45(2): 327–344.
Glazebrook, Karl & Economou, Frossie. 1997. PDL: The Perl data language. Dr. Dobb’s Journal, September 1997. <[URL]>
Gries, Stephan Th. & Hilpert, Martin. 2008. The identification of stages in diachronic data: Variability-based neighbor
clustering. Corpora 3(1): 59–81. <[URL].
Gulordava, Kristina & Baroni, Marco. 2011. A distributional similarity approach to the detection of semantic change in the Google Books Ngram
corpus. In
Proceedings of the GEMS 2011 Workshop on GEometrical Models of Natural Language
Semantics
, Edinburgh, UK, July 2011, 67–71. Stroudsburg PA: ACL. <[URL]>
Heaps, H. Stanley. 1978. Information Retrieval: Computational and Theoretical Aspects. Orlando FL: Academic Press.
Heidegger, Martin. 1927. Sein und Zeit. In Jahrbuch für Philosophie und phänomenologische Forschung, Edmund Husserl (ed.). Tübingen: Neomarius.
Herrmann, J. Bernike. 2013. Metaphor in Academic Discourse [LOT Dissertation Series]. Utrecht: Netherlands Graduate School of Linguistics.
Jurish, Bryan. 2015. DiaCollo: On the trail of diachronic collocations. In
CLARIN Annual Conference 2015
, Wrocław, Poland, October 14–16 2015, 28–31. <[URL]>
Jurish, Bryan, Thomas, Christian & Wiegand, Frank. 2014. Querying the deutsches Textarchiv. In Proceedings of the Workshop “Beyond Single-Shot Text Queries: Bridging the Gap(s) between Research
Communities” (MindTheGap 2014), Berlin, Germany, March 2014, Udo Kruschwitz, Frank Hopfgartner & Cathal Gurrin (eds), 25–30. <[URL]>
Jurish, Bryan, Geyken, Alexander & Werneke, Thomas. 2016. DiaCollo: Diachronen Kollokationen auf der Spur. In
Proceedings DHd 2016: Modellierung – Vernetzung – Visualisierung, University of
Leipzig
, March 2016, 172–175. <[URL]>
Kilgarriff, Adam & Tugwell, David. 2002. Sketching words. In Lexicography and Natural Language Processing: A Festschrift in Honour of B. T. S. Atkins, Marie-Hélène Corréard (ed.), 125–137. <[URL]>
Kilgarriff, Adam, Herman, Andrej, Busta, Jan, Rychlý, Pavel & Jakubíček, Milos. 2015. DIACRAN: A framework for diachronic analysis. In Proceedings of Corpus Linguistics 2015, Federica Formato & Andrew Hardie (eds), 65–70. Lancaster: UCREL.
Kim, Yoon, Chiu, Yi-K, Hanaki, Kentaro, Hegde, Darshan & Petrov, Slav. 2014. Temporal analysis of language through neural language models. In
Proceedings of the ACL 2014 Workshop on Language Technologies and Computational Social
Science
, June 2014, 61–65. Stroudsburg PA: ACL. <[URL].
Manning, Christopher D. & Schütze, Hinrich. 1999. Foundations of Statistical Natural Language Processing. Cambridge MA: The MIT Press.
Mikolov, Tomas, Chen, Kai, Corrado, Greg & Dean, Jeffrey. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. <[URL]>
Moretti, Franco. 2013. Distant Reading. London: Verso Books.
Rychlý, Pavel. 2008. A lexicographer-friendly association score. In Proceedings of
Recent Advances in Slavonic Natural Language Processing
, RASLAN 2008, 6–9. <[URL]>
Sagi, Eyal, Kaufmann, Stefan & Clark, Brady. 2009. Semantic density analysis: Comparing word meaning across time and phonetic space. In
Proceedings of the EACL 2009 Workshop on Geometrical Models of Natural Language
Semantics
, March 2009. Stroudsburg PA: ACL. <[URL]>
Scharloth, Joachim, Eugster, David & Bubenhofer, Noah. 2013. Das Wuchern der Rhizome. Linguistische Diskursanalyse und Data-driven Turn. In Linguistische Diskursanalyse. Neue Perspektiven, Dietrich Busse & Wolfgang Teubert (eds), 345–380. Wiesbaden: VS Verlag.
Schiller, Anne, Teufel, Simone & Thielen, Christine. 1995. Guidelines fur das Tagging deutscher Textcorpora mit STTS. Technical report, University of Stuttgart, Institut für maschinelle Sprachverarbeitung and University of Tübingen, Seminar für Sprachwissenschaft.
Sokirko, A. 2003. A technical overview of DWDS/Dialing Concordance. Talk delivered at the meeting
Computational Linguistics and Intellectual Technologies
, Protvino, Russia. <[URL]>
Stalnaker, Robert C. 1974. Pragmatic presuppositions. In Semantics and Philosophy, Milton K. Munitz & Peter K. Unger (eds), 197–213. New York NY: New York University Press.
Stalnaker, Robert C. 2002. Common ground. Linguistics and Philosophy 25(5): 701–721.
Wang, Xuerui & McCallum, Andrew. 2006. Topics over time: A non-Markov continuous-time model of topical trends. In Proceedings of the
12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD
’06, New York
, 424–433. ACM.
Cited by (3)
Cited by three other publications
De Luca, Ernesto William, Francesca Fallucchi, Bouchra Ghattas & Riem Spielhaus
2024.
The digital transformation processes for supporting digital humanities researchers in text analysis.
Journal of Documentation 80:2
► pp. 378 ff.
Bick, Eckhard, Katja Gorbahn & Nina Kalwa
2023.
Methodological Approaches to the Digital Analysis of Educational Media: Exploring Concepts of Europe and the Nation. In
Exploring Interconnectedness [
Palgrave Studies in Educational Media, ],
► pp. 143 ff.
Bizzoni, Yuri, Stefania Degaetano-Ortlieb, Peter Fankhauser & Elke Teich
2020.
Linguistic Variation and Change in 250 Years of English Scientific Writing: A Data-Driven Approach.
Frontiers in Artificial Intelligence 3
This list is based on CrossRef data as of 27 july 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.