Vol. 29:1 (2024) ► pp.59–86
Association measures for collocation extraction
Automatic evaluation on a large-scale corpus
In this study, we propose a new evaluation scheme to assess the strengths and limitations of collocation extraction measures and explore type-sensitive methods for extracting collocations. We introduced the pooling strategy widely used in Information Retrieval and automated the evaluation process using online dictionaries. Sixteen well-known metrics are evaluated based on their effectiveness and then distributional and linguistic compared. The results show that Group A methods (e.g. z-score, Dice, PMI) are more effective in extracting low-frequency collocations with relatively small extraction scales. In contrast, Group B methods (e.g. t-test, LMI, LLR) perform better at finding high-frequency collocations, most of which outperform Group A methods as the extraction scale increases. Moreover, Group A prefers NN collocations, while Group B identifies collocations with a wide range of syntactic structures. This study provides suggestions for studies to identify hybrid extraction methods as well as for language educators and dictionary compilers.
Article outline
- 1.Introduction
- 2.Approaches to collocation extraction
- 3.Methodology
- 3.1Collocation extraction and validation
- 3.2Evaluation of the extraction methods
- 4.Results
- 4.1General performance
- 4.2Intersection of the metrics
- 4.3Frequency distribution
- 4.4Syntactic structure
- 5.Discussion
- 6.Conclusions
- Notes
-
References