Term distance, frequency and collocations

Johnsen, Lars G.

doi:10.1075/cilt.356.02joh

Part of

Language and Text: Data, models, information and applications
Edited by Adam Pawłowski, Jan Mačutek, Sheila Embleton and George Mikros
[Current Issues in Linguistic Theory 356] 2021
► pp. 21–36

Term distance, frequency and collocations

Lars G. Johnsen | National Library of Norway

In this paper I study two co-occurrence measures, local to a particular corpus, for constructing collocations or relevance relations between words or terms. One is a distance measure, while the other uses different co-occurrence windows, one contained in the other. Both are discussed with respect to the common method of comparing co-occurrence measures within a particular corpus to those of a reference corpus. A practical consequence of these measures is that they may relieve the burden of computing a reference statistic, which may incur a high computational cost. We also believe that distance, as a measure in itself, has a theoretical interest. Being different from frequency, it may add something new to collocation analysis.

Keywords: collocation, term distance, frequency, Bayes, probability, concordance

Article outline

1.Introduction
2.Δ-score and Pointwise Mutual Information
3.Data and technical method
4.Collocations
- 4.1Frequency and context enlargement
- 4.2Distance
  - 4.2.1The verb
  - 4.2.2The noun
5.Discussion
Notes
References

Published online: 22 December 2021

https://doi.org/10.1075/cilt.356.02joh

References

Barnbrook, Geoff, Oliver Mason & Ramesh Krishnamurthy

2013 Collocation applications and implications. Berlin: Springer.

Birkenes, Magnus Breder, Lars G. Johnsen, Arne M. Lindstad & Johanne Ostad

2015 From digital library to n-grams: NB N-gram. In Beáta Megyesi (ed.), Proceedings of the 20th Nordic Conference of Computational Linguistics, NODALIDA 2015, 293–295. Linköping: Linköping University Electronic Press.

Blondel, Vincent D., Jean-Loup Guillaume, Renaud Lambiotte & Etienne Lefebvre

2008 Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment 10. 1–13.

Church, Kenneth Ward & Patrick Hanks

1989 Word association norms, mutual information, and lexicography. In Julia Hirschberg (ed.), Proceedings of the 27th Annual Meeting on Association for Computational Linguistics, 76–83. Stroudsburg, PA: Association for Computational Linguistics.

Firth, J. R.

1957 A synopsis of linguistic theory, 1930–1955. In Studies in linguistic analysis (special volume of the Transactions of the Philological Society), 1–32. Oxford: Basil Blackwell.

Halliday, Mark

1992 Language as system and language as instance: The corpus as a theoretical construct. In Jan Svartvik (ed.), Directions in corpus linguistics: Proceedings of the Nobel Symposium 82 Stockholm, 4–8 August 1991, 61–78. Berlin: de Gruyter.

Jaynes, Edwin. T.

2003 Probability theory: The logic of science. Cambridge: Cambridge University Press.

Johnsen, Lars G. B.

2016 Graph analysis of word networks. In CEUR workshop proceedings, Vol-2021 urn:nbn:de:0074-2021-3.

2019 Modules, Github repository. [URL]

2020 Collocations, data and software.

Kolesnikova, Olga

2016 Survey of word co-occurrence measures for collocation detection. Computación y Sistemas 20(3). 327–344.

Moisl, Hermann

2017 Cluster analysis for corpus linguistics. Berlin: de Gruyter.

Piper, Andrew

2018 Enumerations. Chicago: University of Chicago Press.

Rockwell, Geoffrey & Sinclair Stéfan

2016 Hermeneutica: Computer-assisted interpretation in the humanities. Cambridge, MA: The MIT Press.