Multi-unit association measures
Moving beyond pairs of words
Jonathan Dunn | University of Canterbury
This paper formulates and evaluates a series of multi-unit measures of directional association, building on the pairwise ΔP measure, that are able to quantify association in sequences of varying length and type of representation. Multi-unit measures face an additional segmentation problem: once the implicit length constraint of pairwise measures is abandoned, association measures must also identify the borders of meaningful sequences. This paper takes a vector-based approach to the segmentation problem by using 18 unique measures to describe different aspects of multi-unit association. An examination of these measures across eight languages shows that they are stable across languages and that each provides a unique rank of associated sequences. Taken together, these measures expand corpus-based approaches to association by generalizing across varying lengths and types of representation.
Keywords: association strength, multi-unit association, sequences, ΔP, collocations
Article outline
- 1.Introduction
- 2.Direction of association and sequence length
- 3.Data and methodology
- 4.Analysis: Formulating multi-unit association measures
- 4.1Mean ΔP and Sum ΔP
- 4.2Minimum ΔP
- 4.3Reduced ΔP
- 4.4Divided ΔP
- 4.5End-point ΔP
- 4.6Changed ΔP
- 4.7Summarizing the association measures
- 5.Discussion: Empirical analysis of association measures
- 5.1Relations between directions and measures
- 5.2Stability across languages and representation types
- 6.Using the measures together
- 7.Conclusions
- Acknowledgements
-
References
Published online: 05 October 2018
https://doi.org/10.1075/ijcl.16098.dun
https://doi.org/10.1075/ijcl.16098.dun
References
Biber, B., Reppen, R., Schnur, E., & Ghanem, R.
Church, K., & Hanks, P.
Daudaravičius, V., & Marcinkevičienė, R.
Davies, M.
(2008-) The Corpus of Contemporary American English (COCA): 520 million words, 1990-present. Available online at http://corpus.byu.edu/coca/ (last accessed June 2018).
Dunn, J.
Ellis, N.
Erhan, D., Bengio, Y., Courville, A., Manzagol, P., Vincent, P., & Bengio, S.
Evert, S.
Gries, St. Th.
Gries, St. Th., & Mukherjee, J.
Gries, St. Th., & Stefanowitsch, A.
Jelinek, F.
Koehn, P.
Michelbacher, L., Evert, S., & Schutze, H.
Nguyen, D. Q., Nguyen, D. Q., Pham, D. D., & Pham, S. B.
Pecina, P.
Pedersen, T.
Pennington, J., Socher, R., & Manning, C.
Shimohata, S., Sugio, T., & Nagata, J.
Wible, D., & Tsao, N.
(2010) StringNet as a computational resource for discovering and investigating linguistic constructions. In M. Sahlgren & O. Knutsson (Eds.), Proceedings of the Workshop on Extracting and Using Constructions in Computational Linguistics (NAACL-HTL) (pp. 25–31). Stroudsburg, PA: Association for Computational Linguistics.
Wiechmann, D.