This paper formulates and evaluates a series of multi-unit measures of directional association, building on the pairwise ΔP measure, that are able to quantify association in sequences of varying length and type of representation. Multi-unit measures face an additional segmentation problem: once the implicit length constraint of pairwise measures is abandoned, association measures must also identify the borders of meaningful sequences. This paper takes a vector-based approach to the segmentation problem by using 18 unique measures to describe different aspects of multi-unit association. An examination of these measures across eight languages shows that they are stable across languages and that each provides a unique rank of associated sequences. Taken together, these measures expand corpus-based approaches to association by generalizing across varying lengths and types of representation.
(2010) Bigrams in registers, domains, and varieties: A bigram gravity approach to the homogeneity of corpora. In Mahlberg, M., Diaz, V. & Smith, C. (Eds.) Proceedings of the 2009 Corpus Linguistics Conference. Liverpool: University of Liverpool.
Gries, St. Th.
(2012) Frequencies, probabilities, and association measures in usage- / exemplar-based linguistics. Studies in Language, 11(3): 477–510.
(1990) Self-organizing language modeling for speech recognition. In A. Waibel & K. Lee (eds.), Readings in Speech Recognition (pp. 450–506). San Mateo, CA: Morgan Kaufmann.
Koehn, P.
(2005) Europarl: A parallel corpus for statistical machine translation. In Proceedings of the 10th Machine Translation Summit 2005 (pp. 79–86). Tokyo: Asia-Pacific Association for Machine Translation.
Michelbacher, L., Evert, S., & Schutze, H.
(2007) Asymmetric association measures. In N. Nicolov, G. Angelova & R. Mitkov (Eds.), Proceedings of the 6th International Conference on Recent Advances in Natural Language Processing (RANLP) (pp. 367–372). Amsterdam/Philadelphia: John Benjamins. 367–372.
Nguyen, D. Q., Nguyen, D. Q., Pham, D. D., & Pham, S. B.
(2016) A robust transformation-based learning approach using ripple down rules for part-of-speech tagging. AI Communications, 29(3): 409–422.
Pecina, P.
(2009) Lexical association measures and collocation extraction. Language Resources and Evaluation, 44(1/2): 137–158.
Pedersen, T.
(1998) Dependent bigram identification. In J. Mostow & C. Rich (Eds.), Proceedings of the 15th National Conference on Artificial Intelligence (AAAI-98) (p. 1197). Menlo Park, CA: The AAAI Press.
Pennington, J., Socher, R., & Manning, C.
(2014) GloVe: Global vectors for word representation. In B. Pang & W. Daelemans (Eds.), Proceedings of Empirical Methods in Natural Language Processing (EMNLP) (pp. 1532–1543) Stroudsburg, PA: Association for Computational Linguistics.
Shimohata, S., Sugio, T., & Nagata, J.
(1997) Retrieving collocations by co-occurrences and word order constraints. In P. Cohen & W. Wahlster (Eds.), Proceedings of the Association for Computational Linguistics Annual Meeting (pp. 476–481). Stroudsburg, PA: Association for Computational Linguistics.
Wible, D., & Tsao, N.
(2010) StringNet as a computational resource for discovering and investigating linguistic constructions. In M. Sahlgren & O. Knutsson (Eds.), Proceedings of the Workshop on Extracting and Using Constructions in Computational Linguistics (NAACL-HTL) (pp. 25–31). Stroudsburg, PA: Association for Computational Linguistics.
Wiechmann, D.
(2008) On the computation of collostructional strength: Testing measures of association as expressions of lexical bias. Corpus Linguistics and Linguistic Theory, 4(2): 253–290.
Zhai, C.
(1997) Exploiting context to identify lexical atoms: A statistical view of linguistic context. In P. Brezillon (Ed.), Proceedings of the First International and Interdisciplinary Conference on Modeling and Using Contex (pp.119–129). Rio de Janeiro, Brazil.
Cited by
Cited by 4 other publications
Dunn, Jonathan
2022. Natural Language Processing for Corpus Linguistics,
Dunn, Jonathan
2022. Exposure and emergence in usage-based grammar: computational experiments in 35 languages. Cognitive Linguistics 33:4 ► pp. 659 ff.
Gries, Stefan Th.
2022. Multi-word units (and tokenization more generally): a multi-dimensional and largely information-theoretic approach. Lexis :19
Murakami, Akira & Nick C. Ellis
2022. Effects of Availability, Contingency, and Formulaicity on the Accuracy of English Grammatical Morphemes in Second Language Writing. Language Learning 72:4 ► pp. 899 ff.
This list is based on CrossRef data as of 26 november 2023. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.