A multi-dimensional comparison of the effectiveness and efficiency of association measures in collocation extraction

Deng, Yaochen; Liu, Dilin

doi:10.1075/ijcl.19111.den

Article published In:

International Journal of Corpus Linguistics
Vol. 27:2 (2022) ► pp.191–219

A multi-dimensional comparison of the effectiveness and efficiency of association measures in collocation extraction

Yaochen Deng | Dalian University of Foreign Languages

Dilin Liu | The University of Alabama

Because of the ubiquity and importance of collocations in language use/learning, how to effectively and efficiently identify collocations has been a topic of interest. Although some studies have evaluated many of the existing association measures (AMs) used in the automatic identification of collocations, the results so far have been inconsistent and unclear due to various limitations of the existing studies. Hence, this study makes a multi-dimensional evaluation of the effectiveness and efficiency of seven major AMs in the identification of three types of collocations across five genres and seven corpora of different sizes. The results indicate that while a few AMs, such as Log Likelihood Ratio and Cubic Mutual Information (MI³), are consistently more effective and efficient than the other five AMs being examined, no one AM alone may be adequate in the identification of different types of collocations across different genres and corpus sizes. Research implications are also discussed.

Keywords: association measures, collocations, collocation extraction, effectiveness and efficiency of association measures, multi-dimensional evaluation

Article outline

1.Introduction
2.Background and rationale: Key issues regarding collocation definition/identification
- 2.1Definition and types of collocations
- 2.2Existing AMs and studies on the effectiveness and efficiency of AMs
3.Methodology
- 3.1AMs and factors included for evaluation and comparison
- 3.2Corpora used
- 3.3Tools and procedures used for data analysis and AM evaluation/comparison
4.Results and discussion
- 4.1Results for Research Question 1: Variations among AMs in the general corpus
- 4.2Results for Research Question 2: Effects of genres
- 4.3Results for Research Question 3: Effects of collocation types
- 4.4Results for Research Question 4: Effects of text length
- 4.5Summary discussion
5.Conclusions
Acknowledgements
Note
References

Published online: 10 May 2022

https://doi.org/10.1075/ijcl.19111.den

References (54)

Auksoriūtė, A.

(2008) Eurotermbank–Term Bank of the New Eu Members. Coactivity: Philology, Educology, 16 (2), 12–19.

Barfield, A., & Gyllstad, H.

(2009) Introduction: Researching L2 collocation knowledge and development. In A. Barfield & H. Gyllstad (Eds.), Researching Collocations in Another Language (pp. 1–20). Palgrave Macmillan.

Bartsch, S., & Evert, S.

(2014) Towards a Firthian notion of collocation. In A. Abel & L. Lemnitzer (Eds.), Vernetzungsstrategien, Zugriffsstrukturen und automatisch ermittelte Angaben in Internetwörterbüchern [Networking Strategies, Access Structures and Automatically Retrieved Information in Internet Dictionaries] (pp. 48–61). Institut für Deutsche Sprache.

Benson, M., Benson, E., & Ilson, R.

(2010) The BBI Combinatory Dictionary of English: Your Guide to Collocations and Grammar (3rd ed.). John Benjamins.

Bestgen, Y., & Granger, S.

(2014) Quantifying the development of phraseological competence in L2 English writing: An automated approach. Journal of Second Language Writing, 26 1, 28–41.

Bisht, R. K., Dhami, H. S., & Tiwari, N.

(2006) An evaluation of different statistical techniques of collocation extraction using a probability measure to word combinations. Journal of Quantitative Linguistics, 13 (2–3), 161–175.

BNC Consortium

(2007) British National Corpus (version 3, BNC XML ed.). [URL]

Choueka, Y., Klein, T., & Nuwitz, E.

(1983) Automatic retrieval of frequent idiomatic and collocational expressions in a large corpus. Journal for Literary and Linguistic Computing, 4 (1), 34–38.

Church, K. W., & Hanks, P.

(1990) Word association, norms, mutual information, and lexicography. Computational Linguistics, 16 (1), 22–29.

Church, K. W., Gale, W., Hanks, P., Hindle, R., & Moon, R.

(1994) Lexical substitutability. In B. T. S. Atkins & A. Zampolli (Eds.), Computational Approaches to the Lexicon (pp. 153–177). Oxford University Press.

Crossley, S., Salsbury, T., & McNamara, D.

(2015) Assessing lexical proficiency using analytic ratings: A case for collocation accuracy. Applied Linguistics, 36 (5), 570–590.

Daille, B.

(1994) Approche mixte pour l’extraction automatique de terminologie: statistiques lexicales et filtres linguistiques [Mixed Approach for the Automatic Extraction of Terminology: Lexical Statistics and Linguistic Filters] [Unpublished doctoral dissertation]. Universite’ Paris 7. [URL]

Daille, B., Gaussier, E., & Langé, J. M.

(1998) An evaluation of statistical scores for word association. In J. Ginzburg, Z. Khasidashvili, C. Vogel, J.-J. Levy, & E. Vallduvi (Eds.), The Tbilisi Symposium on Logic, Language and Computation: Selected Papers (pp. 177–188). CSLI.

Daudaravičius, V., & Marcinkevičienė, R.

(2004) Gravity counts for the boundaries of collocations. International Journal of Corpus Linguistics, 9 (2), 321–348.

Davies, M.

(2008–) The Corpus of Contemporary American English (COCA): 560 million words, 1990-present. Available online at [URL]

Dunning, T.

(1993) Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19 (1), 61–74.

Durrant, P., & Schmitt, N.

(2009) To what extent do native and non-native writers make use of collocations? IRAL-International Review of Applied Linguistics in Language Teaching, 47 (2), 157–177.

Erman, B., Forsberg Lundell, F., & Lewis, M.

(2016) Formulaic language in advanced second language acquisition and use. In K. Hyltenstam (Ed.), Advanced Proficiency and Exceptional Ability in Second Languages (pp. 111–147). Walter de Gruyter.

Evert, P.

(2005) The Statistics of Word Co-occurrences: Word Pairs and Collocations [Doctoral dissertation, Universität Stuttgart]. OPUS. [URL]

Evert, S.

(2009) Corpora and collocations. In M. Kytö & A. Lüdeling (Eds.), Corpus Linguistics: An International Handbook (Vol. 21, pp. 1212–1248). Mouton de Gruyter.

Evert, S., & Krenn, B.

(2001) Methods for qualitative evaluation of lexical association measures . In Proceedings of the 39th Annual Meeting of the Association of Computational Linguistics (pp. 188–195). Association of Computational Linguistics. [URL].

Fernández, B. G., & Schmitt, N.

(2015) How much collocation knowledge do L2 learners have? ITL-International Journal of Applied Linguistics, 166 (1), 94–126.

Gablasova, D., Brezina, V., & McEnery, T.

(2017) Collocations in corpus-based language learning research: Identifying, comparing, and interpreting the evidence. Language Learning, 67 (S1),155–179.

Hanks, P.

(1996) Contextual dependency and lexical sets. International Journal of Corpus Linguistics, 1 (1), 75–98.

HarperCollins

(1991) Bank of English.

Heinle ELT

(2008) Collins Cobuild Advanced Dictionary (6th ed.).

Hill, J., & Lewis, M.

(1997) LTP Dictionary of Selected Collocations. Language Teaching.

Hoffman, S., Evert, S., Smith, N., Lee, D., & Berglund Prytz, Y.

(2008) Corpus Linguistics with BNCweb: A Practical Guide. Peter Lang.

Hughes, J., & Hardie, A.

(2019) Corpus linguistics and event-related potentials. In J. Egbert & J. Baker (Eds.), Using Corpus Methods to Triangulate Linguistic Analysis (pp. 185–218). Routledge.

Hunston, S.

(2002) Corpora in Applied Linguistics. Cambridge University Press.

Kilgarriff, A., Rychlý, P., Smrz, P., & Tugwell, D.

(2004) The Sketch Engine. In G. Williams & S. Vessier (Eds.), Proceedings of the 11th EURALEX International Congress (pp. 105–116). Université de Bretagne Sud.

Krenn, B., & Evert, S.

(2001) Can we do better than frequency? A case study on extracting PP-verb collocations. In Proceedings of the ACL Workshop on Collocations (pp. 39–46). Association for Computational Linguistics.

Kumova Metin, S., & Karaoğlan, B.

(2010) Collocation extraction in Turkish texts using statistical methods. In E. Rognvaldsson & H. Loftsson (Eds.), Advances in Natural Language Processing: 7th International Conference on NLP, IceTAL 2010, Reykjavik, Iceland, August 16–18, 2010: Proceedings (pp. 238–249). Springer.

(2011) Measuring collocation tendency of words. Journal of Quantitative Linguistics, 18 (2), 174–187.

Lei, L., & Liu, D.

(2018) The academic English collocation list: A corpus-driven study. International Journal of Corpus Linguistics, 23 (2), 216–243.

Liu, D.

(2010a) Is it a chief, main, major, primary, or principal concern? A corpus-based behavioral profile study of the near-synonyms and its implications. International Journal of Corpus Linguistics, 15 (1), 56–87.

(2010b) Going beyond patterns: Involving cognitive analysis in the learning of collocations. TESOL Quarterly, 44 (1), 4–30.

(2013) Salience and construal in the use of synonymy: A study of two sets of near-synonymous nouns. Cognitive Linguistics, 24 (1), 67–113.

Macmillan

(2012) Macmillan English Dictionary for Advanced Learners.

Manning, C. D., & Schütze, H.

(2000) Foundations of Statistical Natural Language Processing. MIT Press.

Nesselhauf, N.

(2005) Collocations in a Learner Corpus. John Benjamins.

Oxford University Press

(2002) Oxford Collocations Dictionary for Students of English.

Oxford University Press

n.d). Oxford English Corpus.

Pearson Longman

(2009) Longman Dictionary of Contemporary English.

Pecina, P.

(2005) An extensive empirical study of collocation extraction methods. In C. Callison-Burch & S. Wan (Eds.), Proceedings of the ACL Student Research Workshop (pp. 13–18). Association for Computational Linguistics. [URL].

(2010) Lexical association measures and collocation extraction. Language Resources and Evaluation, 44 (1–2),137–158.

Pecina, P., & Schlesinger, P.

(2006) Combining association measures for collocation extraction. In Proceedings of the 21th International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (COLING/ACL 2006, pp. 651–658). Association for Computational Linguistics. [URL].

R Core Team

(2019) R: A language and environment for statistical computing (Version 3.6.0) [Computer software]. R Foundation for Statistical Computing. [URL]

Rychlý, P.

(2008) A lexicographer-friendly association score. In P. Sojka & A. Horák (Eds.), Proceedings of Recent Advances in Slavonic Natural Language Processing (pp. 6–9). Masaryk University. [URL]

Scott, S., & Matwin, S.

(1999) Feature engineering for text classification. In I. Bratko & S. Dzeroski (Eds.), Proceedings of the Sixteenth International Conference on Machine Learning (pp. 379–388). Morgan Kaufmann.

Simpson-Vlach, R., & Ellis, N. C.

(2010) An academic formulas list: New methods in phraseology research. Applied Linguistics, 31(4), 487–512.

Sinclair, J. M.

(1991) Corpus, Concordance, Collocation. Oxford University Press.

Smadja, F., & McKeown, K.

(1991) Using collocations for language generation. Computational Intelligence, 7 (4), 229–239.

Thanopoulos, A., Fakotakis, N., & Kokkinakis, G.

(2002) Comparative evaluation of collocation extraction metrics. In M. González R. & C. Paz Suarez Araujo (Eds.), Proceedings of the Third International Conference on Language Resources and Evaluation (pp. 620–625). ELRA.

Cited by (2)

Cited by 2 other publications

Szudarski, Paweł

2023. Collocations, Corpora and Language Learning,

Ballance, Oliver James

2022. Methodological considerations for the use of mutual information: Examining the role of context in collocation research. Research Methods in Applied Linguistics 1:3 ► pp. 100024 ff.

This list is based on CrossRef data as of 4 july 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.