A multi-dimensional comparison of the effectiveness and efficiency of association measures in collocation
extraction
Because of the ubiquity and importance of collocations in language use/learning, how to effectively and
efficiently identify collocations has been a topic of interest. Although some studies have evaluated many of the existing
association measures (AMs) used in the automatic identification of collocations, the results so far have been inconsistent and
unclear due to various limitations of the existing studies. Hence, this study makes a multi-dimensional evaluation of the
effectiveness and efficiency of seven major AMs in the identification of three types of collocations across five genres and seven
corpora of different sizes. The results indicate that while a few AMs, such as Log Likelihood Ratio and Cubic Mutual Information
(MI3), are consistently more effective and efficient than the other five AMs being examined, no one AM alone may be
adequate in the identification of different types of collocations across different genres and corpus sizes. Research implications
are also discussed.
Article outline
- 1.Introduction
- 2.Background and rationale: Key issues regarding collocation definition/identification
- 2.1Definition and types of collocations
- 2.2Existing AMs and studies on the effectiveness and efficiency of AMs
- 3.Methodology
- 3.1AMs and factors included for evaluation and comparison
- 3.2Corpora used
- 3.3Tools and procedures used for data analysis and AM evaluation/comparison
- 4.Results and discussion
- 4.1Results for Research Question 1: Variations among AMs in the general corpus
- 4.2Results for Research Question 2: Effects of genres
- 4.3Results for Research Question 3: Effects of collocation types
- 4.4Results for Research Question 4: Effects of text length
- 4.5Summary discussion
- 5.Conclusions
- Acknowledgements
- Note
-
References
References (54)
References
Auksoriūtė, A. (2008). Eurotermbank–Term
Bank of the New Eu Members. Coactivity: Philology,
Educology,
16
(2), 12–19.
Barfield, A., & Gyllstad, H. (2009). Introduction:
Researching L2 collocation knowledge and development. In A. Barfield & H. Gyllstad (Eds.), Researching
Collocations in Another
Language (pp. 1–20). Palgrave Macmillan.
Bartsch, S., & Evert, S. (2014). Towards
a Firthian notion of collocation. In A. Abel & L. Lemnitzer (Eds.), Vernetzungsstrategien, Zugriffsstrukturen und automatisch ermittelte Angaben in
Internetwörterbüchern [Networking Strategies, Access Structures and Automatically
Retrieved Information in Internet
Dictionaries] (pp. 48–61). Institut für Deutsche Sprache.
Bestgen, Y., & Granger, S. (2014). Quantifying
the development of phraseological competence in L2 English writing: An automated
approach. Journal of Second Language
Writing,
26
1, 28–41.
Bisht, R. K., Dhami, H. S., & Tiwari, N. (2006). An
evaluation of different statistical techniques of collocation extraction using a probability measure to word
combinations. Journal of Quantitative
Linguistics,
13
(2–3), 161–175.
BNC Consortium. (2007). British National
Corpus (version 3, BNC XML ed.). [URL]
Choueka, Y., Klein, T., & Nuwitz, E. (1983). Automatic
retrieval of frequent idiomatic and collocational expressions in a large corpus. Journal for
Literary and Linguistic
Computing,
4
(1), 34–38.
Church, K. W., & Hanks, P. (1990). Word
association, norms, mutual information, and lexicography. Computational
Linguistics,
16
(1), 22–29.
Church, K. W., Gale, W., Hanks, P., Hindle, R., & Moon, R. (1994). Lexical
substitutability. In B. T. S. Atkins & A. Zampolli (Eds.), Computational
Approaches to the
Lexicon (pp. 153–177). Oxford University Press.
Crossley, S., Salsbury, T., & McNamara, D. (2015). Assessing
lexical proficiency using analytic ratings: A case for collocation accuracy. Applied
Linguistics,
36
(5), 570–590.
Daille, B. (1994). Approche mixte pour l’extraction automatique de terminologie: statistiques lexicales et filtres
linguistiques [Mixed Approach for the Automatic Extraction of Terminology:
Lexical Statistics and Linguistic Filters] [Unpublished doctoral
dissertation]. Universite’ Paris 7. [URL]
Daille, B., Gaussier, E., & Langé, J. M. (1998). An
evaluation of statistical scores for word association. In J. Ginzburg, Z. Khasidashvili, C. Vogel, J.-J. Levy, & E. Vallduvi (Eds.), The
Tbilisi Symposium on Logic, Language and Computation: Selected
Papers (pp. 177–188). CSLI.
Davies, M. (2008–). The
Corpus of Contemporary American English (COCA): 560 million words, 1990-present. Available online
at [URL]
Dunning, T. (1993). Accurate
methods for the statistics of surprise and coincidence. Computational
Linguistics,
19
(1), 61–74.
Durrant, P., & Schmitt, N. (2009). To
what extent do native and non-native writers make use of collocations? IRAL-International
Review of Applied Linguistics in Language
Teaching,
47
(2), 157–177.
Erman, B., Forsberg Lundell, F., & Lewis, M. (2016). Formulaic
language in advanced second language acquisition and use. In K. Hyltenstam (Ed.), Advanced
Proficiency and Exceptional Ability in Second
Languages (pp. 111–147). Walter de Gruyter.
Evert, P. (2005). The
Statistics of Word Co-occurrences: Word Pairs and Collocations [Doctoral
dissertation, Universität Stuttgart]. OPUS. [URL]
Evert, S. (2009). Corpora
and collocations. In M. Kytö & A. Lüdeling (Eds.), Corpus
Linguistics: An International
Handbook (Vol. 21, pp. 1212–1248). Mouton de Gruyter.
Evert, S., & Krenn, B. (2001). Methods
for qualitative evaluation of lexical association
measures
. In Proceedings of the 39th Annual Meeting
of the Association of Computational
Linguistics (pp. 188–195). Association of Computational Linguistics. [URL].
Gablasova, D., Brezina, V., & McEnery, T. (2017). Collocations
in corpus-based language learning research: Identifying, comparing, and interpreting the
evidence. Language
Learning,
67
(S1),155–179.
HarperCollins. (1991). Bank of
English.
Heinle ELT. (2008). Collins Cobuild
Advanced Dictionary (6th ed.).
Hill, J., & Lewis, M. (1997). LTP
Dictionary of Selected Collocations. Language Teaching.
Hoffman, S., Evert, S., Smith, N., Lee, D., & Berglund Prytz, Y. (2008). Corpus
Linguistics with BNCweb: A Practical Guide. Peter Lang.
Hughes, J., & Hardie, A. (2019). Corpus
linguistics and event-related potentials. In J. Egbert & J. Baker (Eds.), Using
Corpus Methods to Triangulate Linguistic
Analysis (pp. 185–218). Routledge.
Hunston, S. (2002). Corpora
in Applied Linguistics. Cambridge University Press.
Kilgarriff, A., Rychlý, P., Smrz, P., & Tugwell, D. (2004). The
Sketch Engine. In G. Williams & S. Vessier (Eds.), Proceedings
of the 11th EURALEX International
Congress (pp. 105–116). Université de Bretagne Sud.
Krenn, B., & Evert, S. (2001). Can
we do better than frequency? A case study on extracting PP-verb
collocations. In Proceedings of the ACL Workshop on
Collocations (pp. 39–46). Association for Computational Linguistics.
Kumova Metin, S., & Karaoğlan, B. (2010). Collocation
extraction in Turkish texts using statistical methods. In E. Rognvaldsson & H. Loftsson (Eds.), Advances
in Natural Language Processing: 7th International Conference on NLP, IceTAL 2010, Reykjavik, Iceland, August 16–18, 2010:
Proceedings (pp. 238–249). Springer.
Kumova Metin, S., & Karaoğlan, B. (2011). Measuring
collocation tendency of words. Journal of Quantitative
Linguistics,
18
(2), 174–187.
Liu, D. (2010b). Going
beyond patterns: Involving cognitive analysis in the learning of collocations. TESOL
Quarterly,
44
(1), 4–30.
Liu, D. (2013). Salience
and construal in the use of synonymy: A study of two sets of near-synonymous nouns. Cognitive
Linguistics,
24
(1), 67–113.
Macmillan. (2012). Macmillan English
Dictionary for Advanced Learners.
Manning, C. D., & Schütze, H. (2000). Foundations
of Statistical Natural Language Processing. MIT Press.
Oxford University Press. (2002). Oxford
Collocations Dictionary for Students of English.
Oxford University
Press. (n.d). Oxford English Corpus.
Pearson Longman. (2009). Longman
Dictionary of Contemporary English.
Pecina, P. (2005). An
extensive empirical study of collocation extraction methods. In C. Callison-Burch & S. Wan (Eds.), Proceedings
of the ACL Student Research
Workshop (pp. 13–18). Association for Computational Linguistics. [URL].
Pecina, P. (2010). Lexical
association measures and collocation extraction. Language Resources and
Evaluation,
44
(1–2),137–158.
Pecina, P., & Schlesinger, P. (2006). Combining
association measures for collocation extraction. In Proceedings of
the 21th International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational
Linguistics (COLING/ACL 2006, pp. 651–658). Association for Computational Linguistics. [URL].
R Core Team. (2019). R: A language and
environment for statistical computing (Version 3.6.0) [Computer software]. R Foundation for Statistical Computing. [URL]
Rychlý, P. (2008). A
lexicographer-friendly association score. In P. Sojka & A. Horák (Eds.), Proceedings
of Recent Advances in Slavonic Natural Language
Processing (pp. 6–9). Masaryk University. [URL]
Scott, S., & Matwin, S. (1999). Feature
engineering for text classification. In I. Bratko & S. Dzeroski (Eds.), Proceedings
of the Sixteenth International Conference on Machine
Learning (pp. 379–388). Morgan Kaufmann.
Simpson-Vlach, R., & Ellis, N. C. (2010). An academic formulas list: New methods in phraseology research. Applied Linguistics, 31(4), 487–512.
Sinclair, J. M. (1991). Corpus,
Concordance, Collocation. Oxford University Press.
Smadja, F., & McKeown, K. (1991). Using
collocations for language generation. Computational
Intelligence,
7
(4), 229–239.
Thanopoulos, A., Fakotakis, N., & Kokkinakis, G. (2002). Comparative
evaluation of collocation extraction metrics. In M. González R. & C. Paz Suarez Araujo (Eds.), Proceedings
of the Third International Conference on Language Resources and
Evaluation (pp. 620–625). ELRA.
Cited by (2)
Cited by two other publications
Szudarski, Paweł
2023.
Collocations, Corpora and Language Learning,
Ballance, Oliver James
2022.
Methodological considerations for the use of mutual information: Examining the role of context in collocation research.
Research Methods in Applied Linguistics 1:3
► pp. 100024 ff.
This list is based on CrossRef data as of 5 august 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.