HAMLET
Hybrid Adaptable Machine Learning approach to Extract Terminology
Els Lefever |
LT3 Language and Translation Technology Team
Automatic term extraction (ATE) is an important task within natural language processing, both separately, and as a preprocessing step for other tasks. In recent years, research has moved far beyond the traditional hybrid approach where candidate terms are extracted based on part-of-speech patterns and filtered and sorted with statistical termhood and unithood measures. While there has been an explosion of different types of features and algorithms, including machine learning methodologies, some of the fundamental problems remain unsolved, such as the ambiguous nature of the concept “term”. This has been a hurdle in the creation of data for ATE, meaning that datasets for both training and testing are scarce, and system evaluations are often limited and rarely cover multiple languages and domains. The ACTER Annotated Corpora for Term Extraction Research contain manual term annotations in four domains and three languages and have been used to investigate a supervised machine learning approach for ATE, using a binary random forest classifier with multiple types of features. The resulting system (HAMLET Hybrid Adaptable Machine Learning approach to Extract Terminology) provides detailed insights into its strengths and weaknesses. It highlights a certain unpredictability as an important drawback of machine learning methodologies, but also shows how the system appears to have learnt a robust definition of terms, producing results that are state-of-the-art, and contain few errors that are not (part of) terms in any way. Both the amount and the relevance of the training data have a substantial effect on results, and by varying the training data, it appears to be possible to adapt the system to various desired outputs, e.g., different types of terms. While certain issues remain difficult – such as the extraction of rare terms and multiword terms – this study shows how supervised machine learning is a promising methodology for ATE.
Article outline
- 1.Introduction
- 2.Related research
- 3.ACTER Annotated Corpora for Term Extraction Research
- 4.Methodology and experiments
- 4.1Experimental setup
- 4.1.1Preprocessing and CT selection based on POS
- 4.1.2Features
- 4.1.3Algorithm, evaluation, and optimisation
- 4.2Results per corpus
- 5.Analysis and discussion
- 5.1Error analysis
- 5.2Impact of annotation types
- 5.3Impact of Features
- 5.3.1Feature group selection
- 5.3.2Feature importance
- 6.Conclusions and future research
- Notes
-
References
For any use beyond this license, please contact the publisher at rights@benjamins.nl.