HAMLET: Hybrid Adaptable Machine Learning approach to Extract Terminology

Rigouts Terryn, Ayla; Hoste, Véronique; Lefever, Els

doi:10.1075/term.20017.rig

Article published In:

Terminology
Vol. 27:2 (2021) ► pp.254–293

HAMLET

Hybrid Adaptable Machine Learning approach to Extract Terminology

Ayla Rigouts Terryn | LT Language and Translation Technology Team

Véronique Hoste | LT Language and Translation Technology Team

Els Lefever | LT Language and Translation Technology Team

Automatic term extraction (ATE) is an important task within natural language processing, both separately, and as a preprocessing step for other tasks. In recent years, research has moved far beyond the traditional hybrid approach where candidate terms are extracted based on part-of-speech patterns and filtered and sorted with statistical termhood and unithood measures. While there has been an explosion of different types of features and algorithms, including machine learning methodologies, some of the fundamental problems remain unsolved, such as the ambiguous nature of the concept “term”. This has been a hurdle in the creation of data for ATE, meaning that datasets for both training and testing are scarce, and system evaluations are often limited and rarely cover multiple languages and domains. The ACTER Annotated Corpora for Term Extraction Research contain manual term annotations in four domains and three languages and have been used to investigate a supervised machine learning approach for ATE, using a binary random forest classifier with multiple types of features. The resulting system (HAMLET Hybrid Adaptable Machine Learning approach to Extract Terminology) provides detailed insights into its strengths and weaknesses. It highlights a certain unpredictability as an important drawback of machine learning methodologies, but also shows how the system appears to have learnt a robust definition of terms, producing results that are state-of-the-art, and contain few errors that are not (part of) terms in any way. Both the amount and the relevance of the training data have a substantial effect on results, and by varying the training data, it appears to be possible to adapt the system to various desired outputs, e.g., different types of terms. While certain issues remain difficult – such as the extraction of rare terms and multiword terms – this study shows how supervised machine learning is a promising methodology for ATE.

Keywords: terminology, automatic term extraction, comparable corpora, named entities

Article outline

1.Introduction
2.Related research
3.ACTER Annotated Corpora for Term Extraction Research
4.Methodology and experiments
- 4.1Experimental setup
  - 4.1.1Preprocessing and CT selection based on POS
  - 4.1.2Features
  - 4.1.3Algorithm, evaluation, and optimisation
- 4.2Results per corpus
5.Analysis and discussion
- 5.1Error analysis
- 5.2Impact of annotation types
- 5.3Impact of Features
  - 5.3.1Feature group selection
  - 5.3.2Feature importance
6.Conclusions and future research
Notes
References

Available under the Creative Commons Attribution-NonCommercial (CC BY-NC) 4.0 license.

For any use beyond this license, please contact the publisher at [email protected].

Published online: 20 August 2021

https://doi.org/10.1075/term.20017.rig

References

Amjadian, Ehsan, Diana Zaiu Inkpen, T. Sima Paribakht, and Farahnaz Faez

2018 “Distributed Specificity for Automatic Terminology Extraction.” Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication 24 (1): 23–40.

Astrakhantsev, Nikita, D. Fedorenko, and D. Yu. Turdakov

2015 “Methods for Automatic Term Recognition in Domain-Specific Text Collections: A Survey.” Programming and Computer Software 41 (6): 336–49.

Azé, Jérôme, Mathieu Roche, Yves Kodratoff, and Michèle Sebag

2005 “Preference Learning in Terminology Extraction: A ROC-Based Approach.” In Proceeedings of Applied Stochastic Models and Data Analysis, 209–2019. Brest, France. [URL]

Barrón-Cedeño, Alberto, Gerardo Sierra, Patrick Drouin, and Sophia Ananiadou

2009 “An Improved Automatic Term Recognition Method for Spanish.” In Computational Linguistics and Intelligent Text Processing, edited by Alexander Gelbukh, 125–36. Lecture Notes in Computer Science. Berlin, Heidelberg: Springer Berlin Heidelberg.

Bolshakova, Elena, Natalia Loukachevitch, and Michael Nokel

2013 “Topic Models Can Improve Domain Term Extraction.” In Advances in Information Retrieval, edited by Pavel Serdyukov, Pavel Braslavski, Sergei O. Kuznetsov, Jaap Kamps, Stefan Rüger, Eugene Agichtein, Ilya Segalovich, and Emine Yilmaz, 78141:684–87. Berlin, Heidelberg: Springer Berlin Heidelberg.

Bordea, Georgeta, Paul Buitelaar, and Tamara Polajnar

2013 “Domain-Independent Term Extraction Through Domain Modelling.” In Proceedings of the 10th International Conference for Terminology and Artificial Intelligence (TIA), 61–68. Paris, France.

Conrado, Merley da Silva, Thiago A. Salgueiro Pardo, and Solange Oliveira Rezende

2013 “A Machine Learning Approach to Automatic Term Extraction Using a Rich Feature Set.” In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop, 16–23. Atlanta, GA, USA: Association for Computational Linguistics.

Davies, Mark

2017 “The New 4.3 Billion Word NOW Corpus, with 4--5 Million Words of Data Added Every Day.” In Proceedings of the 9th International Corpus Linguistics Conference. Birmingham. Birmingham, UK. [URL]

Drouin, Patrick

2003 “Term Extraction Using Non-Technical Corpora as a Point of Leverage.” Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication 9 (1): 99–115.

Drouin, Patrick, Marie-Claude L’Homme, and Benoıt Robichaud

2018 “Lexical Profiling of Environmental Corpora.” In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC), 3419–25. Miyazaki, Japan: European Language Resources Association.

Durán-Muñoz, Isabel

2019 “Methodological Proposal to Build a Corpus-Based Ontology in Terminology.” Lingue e Linguaggi.

Fedorenko, Denis, Nikita Astrakhantsev, and Denis Turdakov

2013 “Automatic Recognition of Domain-Specific Terms: An Experimental Evaluation.” In Proceedings of the Ninth Spring Researcher’s Colloquium on Database and Information Systems, 261:15–23. Kazan, Russia.

Foo, Jody

2009 “Term Extraction Using Machine Learning.” Linköping University, LINKÖPING, 1–8.

Foo, Jody, and Magnus Merkel

2010 “Using Machine Learning to Perform Automatic Term Recognition.” In Proceedings of the LREC 2010 Workshop on Methods for Automatic Acquisition of Language Resources and Their Evaluation Methods, 49–54. Valetta, Malta: European Language Resources Association.

Gao, Yuze, and Yu Yuan

2019 “Feature-Less End-to-End Nested Term Extraction.” ArXiv:1908.05426 [Cs, Stat], August. [URL].

Graff, David, Ângelo Mendonça, and Denise DiPersio

2011 “French Gigaword Third Edition LDC2011T10.” Philadelphia, USA: Linguistic Data Consortium.

Hätty, Anna, and Sabine Schulte im Walde

2018 “Fine-Grained Termhood Prediction for German Compound Terms Using Neural Networks.” In Proceedings of the Joint Workshop on,Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018), 62–73. Sante Fe, New Mexico, USA: Association for Computational Linguistics.

Hätty, Anna, Simon Tannert, and Ulrich Heid

2017 “Creating a Gold Standard Corpus for Terminological Annotation from Online Forum Data.” In Proceedings of Language, Ontology, Terminology and Knowledge Structures Workshop (LOTKS 2017). Montpellier, France: Association for Computational Linguistics.

Hazem, Amir, Mérieme Bouhandi, Florian Boudin, and Béatrice Daille

2020 “TermEval 2020: TALN-LS2N System for Automatic Term Extraction.” In Proceedings of the 6th International Workshop on Computational Terminology (COMPUTERM 2020), 95–100. Marseille, France: European Language Resources Association.

Judea, Alex, Hinrich Schütze, and Sören Brügmann

2014 “Unsupervised Training Set Generation for Automatic Acquisition of Technical Terminology in Patents.” In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, 290–300. Dublin, Ireland: Dublin City University and Association for Computational Linguistics.

Kageura, Kyo, and Elizabeth Marshman

2019 “Terminology Extraction and Management.” In The Routledge Handbook of Translation and Technology, edited by O’Hagan, Minako.

Kageura, Kyo, and Bin Umino

1996 “Methods of Automatic Term Recognition.” Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication 3 (2): 259–89.

Karan, Mladen, Jan Snajder, and Dalbelo Basic, Bojana

2012 “Evaluation of Classification Algorithms and Features for Collocation Extraction in Croatian.” In Proceedings of Eighth International Conference on Language Resources and Evaluation (LREC 2012), 657–62. Istanbul, Turkey: European Language Resources Association.

Kauter, Marian van de, Geert Coorman, Els Lefever, Bart Desmet, Lieve Macken, and Véronique Hoste

2013 “LeTs Preprocess: The Multilingual LT3 Linguistic Preprocessing Toolkit.” Computational Linguistics in the Netherlands Journal 31: 103–20.

Kessler, Rémy, Nicolas Béchet, and Giuseppe Berio

2019 “Extraction of Terminology in the Field of Construction.” In Proceedings of the First International Conference on Digital Data Processing (DDP), 22–26. London, UK: IEEE Computer Society.

Kosa, Victoria, David Chaves-Fraga, Hennadii Dobrovolskyi, and Vadim Ermolayev

2020 “Optimized Term Extraction Method Based on Computing Merged Partial C-Values.” In Information and Communication Technologies in Education, Research, and Industrial Applications. ICTERI 2019, 11751:24–49. Communications in Computer and Information Science. Cham: Springer International Publishing.

Koutropoulou, Theoni, and Efstratios Efstratios

2019 “TMG-BoBI: Generating Back-of-the-Book Indexes with the Text-to-Matrix-Generator.” In Proceedings of the 10th International Conference on Information, Intelligence, Systems and Applications, IISA 2019, 1–8. Patras, Greece.

Kozakov, L., Y. Park, T. Fin, Y. Drissi, Y. Doganata, and T. Cofino

2004 “Glossary Extraction and Utilization in the Information Search and Delivery System for IBM Technical Support.” IBM Systems Journal 43 (3): 546–63.

Kucza, Maren, Jan Niehues, Thomas Zenkel, Alex Waibel, and Sebastian Stüker

2018 “Term Extraction via Neural Sequence Labeling a Comparative Evaluation of Strategies Using Recurrent Neural Networks.” In Proceedings of Interspeech 2018, the 19th Annual Conference of the International Speech Communication Association, 2072–76. Hyderabad, India: International Speech Communication Association.

Ljubešić, Nikola, Tomaž Erjavec, and Darja Fišer

2018 “KAS-Term and KAS-Biterm: Datasets and Baselines for Monolingual and Bilingual Terminology Extraction from Academic Writing.” Digital Humanities, 71.

Ljubešić, Nikola, Darja Fišer, and Tomaž Erjavec

2019 “KAS-Term: Extracting Slovene Terms from Doctoral Theses via Supervised Machine Learning.” In Text, Speech, and Dialogue. TSD 2019. Vol. 116971. Lecture Notes in Computer Science. Springer. [URL].

Loukachevitch, Natalia

2012 “Automatic Term Recognition Needs Multiple Evidence.” In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC), 2401–7. Istanbul, Turkey: European Language Resources Association.

Loukachevitch, Natalia, and Michael Nokel

2013 “An Experimental Study of Term Extraction for Real Information-Retrieval Thesauri.” In Proceedings 10th International Conference on Terminology and Artificial Intelligence TIA 2013, 69–76. Paris, France.

Macken, Lieve, Els Lefever, and Véronique Hoste

2013 “TExSIS: Bilingual Terminology Extraction from Parallel Corpora Using Chunk-Based Alignment.” Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication 19 (1): 1–30.

Mayorov, V., I. Andrianov, Nikita Astrakhantsev, Avanesov, V., Kozlov, I., and Turdakov, D.

2015 “A High Precision Method for Aspect Extraction in Russian.” In Computational Linguistics and Intellectual Technologies: Papers from the Annual Conference “Dialogue.” Vol. 21. Moscow, Russia.

McCrae, John P., and Adrian Doyle

2019 “Adapting Term Recognition to an Under-Resourced Language: The Case of Irish.” In Proceedings of the Celtic Language Technology Workshop, 48–57. Dublin, Ireland.

Meijer, Kevin, Flavius Frasincar, and Frederik Hogenboom

2014 “A Semantic Approach for Extracting Domain Taxonomies from Text.” Decision Support Systems 621 (June): 78–93.

Meyers, Adam L., Yifan He, Zachary Glass, John Ortega, Shasha Liao, Angus Grieve-Smith, Ralph Grishman, and Olga Babko-Malaya

2018 “The Termolator: Terminology Recognition Based on Chunking, Statistical and Search-Based Scores.” Frontiers in Research Metrics and Analytics 31 (June).

Oostdijk, Nelleke, Martin Reynaert, Véronique Hoste, and Ineke Schuurman

2013 “The Construction of a 500-Million-Word Reference Corpus of Contemporary Written Dutch.” In Essential Speech and Language Technology for Dutch, edited by Peter Spyns and Jan Odijk, 219–47. Berlin, Heidelberg: Springer Berlin Heidelberg.

Patry, Alexandre, and Philippe Langlais

2005 “Corpus-Based Terminology Extraction.” In Terminology and Content Development – Proceedings of the 7th International Conference on Terminology and Knowledge Engineering, 313–21. Copenhagen, Denmark.

Pedregosa, Fabian, Gael Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, et al.

2011 “Scikit-Learn: Machine Learning in Python.” Machine Learning in Python, no. 121: 2825–30.

Peñas, Anselmo, Felisa Verdejo, and Julio Gonzalo

2001 “Corpus-Based Terminology Extraction Applied to Information Access.” In Proceedings of Corpus Linguistics, 91. Lancaster, UK.

Petrov, Slav, Dipanjan Das, and Ryan McDonald

2012 “A Universal Part-of-Speech Tagset.” In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), 2089–96. Istanbul, Turkey: European Language Resources Association.

Pollak, Senja, Andraž Repar, Matej Martinc, and Vid Podpečan

2019 “Karst Exploration: Extracting Terms and Definitions from Karst Domain Corpus.” In Proceedings of ELex 2019, 934–56. Sintra, Portugal.

Qasemizadeh, Behrang, and Siegfried Handschuh

2014 “Evaluation of Technology Term Recognition with Random Indexing.” In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC), 4027–32. Reykjavik, Iceland: European Language Resources Association.

Ramisch, Carlos, Aline Villavicencio, and Christian Boitet

2010 “Mwetoolkit: A Framework for Multiword Expression Identification.” In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10), 662–69. Valetta, Malta: European Language Resources Association.

Rigouts Terryn, Ayla, Patrick Drouin, Véronique Hoste, and Els Lefever

2019 “Analysing the Impact of Supervised Machine Learning on Automatic Term Extraction: HAMLET vs TermoStat.” In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019), 1012–21. Varna, Bulgaria.

Rigouts Terryn, Ayla, Véronique Hoste, Patrick Drouin, and Els Lefever

2020 “TermEval 2020: Shared Task on Automatic Term Extraction Using the Annotated Corpora for Term Extraction Research (ACTER) Dataset.” In Proceedings of the LREC 2020 6th International Workshop on Computational Terminology (COMPUTERM 2020), 85–94. Marseille, France: European Language Resources Association.

Rigouts Terryn, Ayla, Véronique Hoste, and Els Lefever

2018 “A Gold Standard for Multilingual Automatic Term Extraction from Comparable Corpora: Term Structure and Translation Equivalents.” In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC), 1803–8. Miyazaki, Japan: European Language Resources Association.

2020 “In No Uncertain Terms: A Dataset for Monolingual and Multilingual Automatic Term Extraction from Comparable Corpora.” Language Resources and Evaluation 54 (2): 385–418.

Šajatović, Antonio, Maja Buljan, Jan Šnajder, and Bojana Dalbelo Bašić

2019 “Evaluating Automatic Term Extraction Methods on Individual Documents.” In Proceedings of the Joint Workshop on Multiword Expressions and WordNet (MWE-WN 2019), 149–54. Florence, Italy: Association for Computational Linguistics.

Shah, Sapan, S. Sarath, and Reddy Shreedhar

2019 “Similarity Driven Unsupervised Learning for Materials Science Terminology Extraction.” Computación y Sistemas 23 (3): 1005–13.

Ville-Ometz, Fabienne, Jean Royauté, and Alain Zasadzinski

2007 “Enhancing in Automatic Recognition and Extraction of Term Variants with Linguistic Features.” Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication 13 (1): 35–59.

Vintar, Spela

2010 “Bilingual Term Recognition Revisited.” Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication 16 (2): 141–58.

Vivaldi, Jorge, Luís Màrquez, and Horacio Rodríguez

2001 “Improving Term Extraction by System Combination Using Boosting.” In Proceedings of the 12th European Conference on Machine Learning (ECML 2001), edited by Luc Raedt and Peter Flach, 21671:515–26. Berlin, Heidelberg: Springer Berlin Heidelberg.

Vivaldi, Jorge, and Horacio Rodríguez

2001 “Improving Term Extraction by Combining Different Techniques.” Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication 7 (1): 31–48.

Wang, Rui, Wei Liu, and Chris McDonald

2016 “Featureless Domain-Specific Term Extraction with Minimal Labelled Data.” In Proceedings of Australasian Language Technology Association Workshop, 103–12. Melbourne, Australia.

Wolf, Petra, Ulrike Bernardini, Christian Federmann, and Hunsicker Sabine

2011 “From Statistical Term Extraction to Hybrid Machine Translation.” In Proceedings of the 15th Conference of the European Association for Machine Translation, edited by Mikel L. Forcada, Heidi Depraetere, and Vincent Vandeghinste, 225–32. Leuven, Belgium: European Association for Machine Translation.

Wolpert, David H.

1996 “The Lack of a Priori Distinctions between Learning Algorithms.” Neural Computation 8 (7): 1341–90.