This paper formulates and evaluates a series of multi-unit measures of directional association, building on the pairwise ΔP measure, that are able to quantify association in sequences of varying length and type of representation. Multi-unit measures face an additional segmentation problem: once the implicit length constraint of pairwise measures is abandoned, association measures must also identify the borders of meaningful sequences. This paper takes a vector-based approach to the segmentation problem by using 18 unique measures to describe different aspects of multi-unit association. An examination of these measures across eight languages shows that they are stable across languages and that each provides a unique rank of associated sequences. Taken together, these measures expand corpus-based approaches to association by generalizing across varying lengths and types of representation.
Davies, M. (2008-). The Corpus of Contemporary American English (COCA): 520 million words, 1990-present. Available online at [URL] (last accessed June 2018).
Dunn, J. (2017). Computational learning of construction grammars. Language and Cognition, 9(2): 254–292.
Dunn, J. (2018). Finding variants for construction-based dialectometry: A corpus-based approach to regional CxGs. Cognitive Linguistics, 29(2): 275–311.
Ellis, N. (2007). Language acquisition as rational contingency learning. Applied Linguistics, 27(1): 1–24.
Erhan, D., Bengio, Y., Courville, A., Manzagol, P., Vincent, P., & Bengio, S. (2010). Why does unsupervised pre-training help deep learning?Journal of Machine Learning Research, 111: 625–660.
Evert, S. (2005). The Statistics of Word Co-Occurrences: Word Pairs and Collocations (Unpublished doctoral dissertation). Stuttgart, University of Stuttgart.
Gries, St. Th. (2010). Bigrams in registers, domains, and varieties: A bigram gravity approach to the homogeneity of corpora. In Mahlberg, M., Diaz, V. & Smith, C. (Eds.) Proceedings of the 2009 Corpus Linguistics Conference. Liverpool: University of Liverpool.
Gries, St. Th. (2012). Frequencies, probabilities, and association measures in usage- / exemplar-based linguistics. Studies in Language, 11(3): 477–510.
Jelinek, F. (1990). Self-organizing language modeling for speech recognition. In A. Waibel & K. Lee (eds.), Readings in Speech Recognition (pp. 450–506). San Mateo, CA: Morgan Kaufmann.
Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. In Proceedings of the 10th Machine Translation Summit 2005 (pp. 79–86). Tokyo: Asia-Pacific Association for Machine Translation.
Michelbacher, L., Evert, S., & Schutze, H. (2007). Asymmetric association measures. In N. Nicolov, G. Angelova & R. Mitkov (Eds.), Proceedings of the 6th International Conference on Recent Advances in Natural Language Processing (RANLP) (pp. 367–372). Amsterdam/Philadelphia: John Benjamins. 367–372.
Nguyen, D. Q., Nguyen, D. Q., Pham, D. D., & Pham, S. B. (2016). A robust transformation-based learning approach using ripple down rules for part-of-speech tagging. AI Communications, 29(3): 409–422.
Pecina, P. (2009). Lexical association measures and collocation extraction. Language Resources and Evaluation, 44(1/2): 137–158.
Pedersen, T. (1998). Dependent bigram identification. In J. Mostow & C. Rich (Eds.), Proceedings of the 15th National Conference on Artificial Intelligence (AAAI-98) (p. 1197). Menlo Park, CA: The AAAI Press.
Pennington, J., Socher, R., & Manning, C. (2014). GloVe: Global vectors for word representation. In B. Pang & W. Daelemans (Eds.), Proceedings of Empirical Methods in Natural Language Processing (EMNLP) (pp. 1532–1543) Stroudsburg, PA: Association for Computational Linguistics.
Shimohata, S., Sugio, T., & Nagata, J. (1997). Retrieving collocations by co-occurrences and word order constraints. In P. Cohen & W. Wahlster (Eds.), Proceedings of the Association for Computational Linguistics Annual Meeting (pp. 476–481). Stroudsburg, PA: Association for Computational Linguistics.
Wible, D., & Tsao, N. (2010). StringNet as a computational resource for discovering and investigating linguistic constructions. In M. Sahlgren & O. Knutsson (Eds.), Proceedings of the Workshop on Extracting and Using Constructions in Computational Linguistics (NAACL-HTL) (pp. 25–31). Stroudsburg, PA: Association for Computational Linguistics.
Wiechmann, D. (2008). On the computation of collostructional strength: Testing measures of association as expressions of lexical bias. Corpus Linguistics and Linguistic Theory, 4(2): 253–290.
Zhai, C. (1997). Exploiting context to identify lexical atoms: A statistical view of linguistic context. In P. Brezillon (Ed.), Proceedings of the First International and Interdisciplinary Conference on Modeling and Using Contex (pp.119–129). Rio de Janeiro, Brazil.
Cited by (6)
Cited by six other publications
LI, Jingjie & Wenjie HU
2024. Identification of Sentence Stems Characteristic of Chinese Learner English Writing. Heliyon► pp. e37166 ff.
Dunn, Jonathan
2022. Natural Language Processing for Corpus Linguistics,
Dunn, Jonathan
2022. Exposure and emergence in usage-based grammar: computational experiments in 35 languages. Cognitive Linguistics 33:4 ► pp. 659 ff.
Dunn, Jonathan
2024. Computational Construction Grammar,
Gries, Stefan Th.
2022. Multi-word units (and tokenization more generally): a multi-dimensional and largely information-theoretic approach. Lexis :19
Murakami, Akira & Nick C. Ellis
2022. Effects of Availability, Contingency, and Formulaicity on the Accuracy of English Grammatical Morphemes in Second Language Writing. Language Learning 72:4 ► pp. 899 ff.
This list is based on CrossRef data as of 11 september 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.