Hybrid Adaptable Machine Learning approach to Extract Terminology
Automatic term extraction (ATE) is an important task within natural language processing, both separately, and as a preprocessing step for other tasks. In recent years, research has moved far beyond the traditional hybrid approach where candidate terms are extracted based on part-of-speech patterns and filtered and sorted with statistical termhood and unithood measures. While there has been an explosion of different types of features and algorithms, including machine learning methodologies, some of the fundamental problems remain unsolved, such as the ambiguous nature of the concept “term”. This has been a hurdle in the creation of data for ATE, meaning that datasets for both training and testing are scarce, and system evaluations are often limited and rarely cover multiple languages and domains. The ACTER Annotated Corpora for Term Extraction Research contain manual term annotations in four domains and three languages and have been used to investigate a supervised machine learning approach for ATE, using a binary random forest classifier with multiple types of features. The resulting system (HAMLET Hybrid Adaptable Machine Learning approach to Extract Terminology) provides detailed insights into its strengths and weaknesses. It highlights a certain unpredictability as an important drawback of machine learning methodologies, but also shows how the system appears to have learnt a robust definition of terms, producing results that are state-of-the-art, and contain few errors that are not (part of) terms in any way. Both the amount and the relevance of the training data have a substantial effect on results, and by varying the training data, it appears to be possible to adapt the system to various desired outputs, e.g., different types of terms. While certain issues remain difficult – such as the extraction of rare terms and multiword terms – this study shows how supervised machine learning is a promising methodology for ATE.
Keywords: terminology, automatic term extraction, comparable corpora, named entities
Available under the Creative Commons Attribution-NonCommercial (CC BY-NC) 4.0 license.
For any use beyond this license, please contact the publisher at firstname.lastname@example.org.
Published online: 20 August 2021
Amjadian, Ehsan, Diana Zaiu Inkpen, T. Sima Paribakht, and Farahnaz Faez
Astrakhantsev, Nikita, D. Fedorenko, and D. Yu. Turdakov
Azé, Jérôme, Mathieu Roche, Yves Kodratoff, and Michèle Sebag
2005 “Preference Learning in Terminology Extraction: A ROC-Based Approach.” In Proceeedings of Applied Stochastic Models and Data Analysis, 209–2019. Brest, France. http://arxiv.org/abs/cs/0512050
Barrón-Cedeño, Alberto, Gerardo Sierra, Patrick Drouin, and Sophia Ananiadou
Bolshakova, Elena, Natalia Loukachevitch, and Michael Nokel
Bordea, Georgeta, Paul Buitelaar, and Tamara Polajnar
Conrado, Merley da Silva, Thiago A. Salgueiro Pardo, and Solange Oliveira Rezende
2013 “A Machine Learning Approach to Automatic Term Extraction Using a Rich Feature Set.” In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop, 16–23. Atlanta, GA, USA: Association for Computational Linguistics.
2017 “The New 4.3 Billion Word NOW Corpus, with 4--5 Million Words of Data Added Every Day.” In Proceedings of the 9th International Corpus Linguistics Conference. Birmingham. Birmingham, UK. https://www.english-corpora.org/now
Drouin, Patrick, Marie-Claude L’Homme, and Benoıt Robichaud
Fedorenko, Denis, Nikita Astrakhantsev, and Denis Turdakov
Foo, Jody, and Magnus Merkel
Gao, Yuze, and Yu Yuan
2019 “Feature-Less End-to-End Nested Term Extraction.” ArXiv:1908.05426 [Cs, Stat], August. http://arxiv.org/abs/1908.05426.
Graff, David, Ângelo Mendonça, and Denise DiPersio
Hätty, Anna, and Sabine Schulte im Walde
Hätty, Anna, Simon Tannert, and Ulrich Heid
Hazem, Amir, Mérieme Bouhandi, Florian Boudin, and Béatrice Daille
Judea, Alex, Hinrich Schütze, and Sören Brügmann
2014 “Unsupervised Training Set Generation for Automatic Acquisition of Technical Terminology in Patents.” In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, 290–300. Dublin, Ireland: Dublin City University and Association for Computational Linguistics.
Kageura, Kyo, and Elizabeth Marshman
Kageura, Kyo, and Bin Umino
Karan, Mladen, Jan Snajder, and Dalbelo Basic, Bojana
Kauter, Marian van de, Geert Coorman, Els Lefever, Bart Desmet, Lieve Macken, and Véronique Hoste
Kessler, Rémy, Nicolas Béchet, and Giuseppe Berio
Kosa, Victoria, David Chaves-Fraga, Hennadii Dobrovolskyi, and Vadim Ermolayev
2020 “Optimized Term Extraction Method Based on Computing Merged Partial C-Values.” In Information and Communication Technologies in Education, Research, and Industrial Applications. ICTERI 2019, 1175:24–49. Communications in Computer and Information Science. Cham: Springer International Publishing.
Koutropoulou, Theoni, and Efstratios Efstratios
Kozakov, L., Y. Park, T. Fin, Y. Drissi, Y. Doganata, and T. Cofino
Kucza, Maren, Jan Niehues, Thomas Zenkel, Alex Waibel, and Sebastian Stüker
2018 “Term Extraction via Neural Sequence Labeling a Comparative Evaluation of Strategies Using Recurrent Neural Networks.” In Proceedings of Interspeech 2018, the 19th Annual Conference of the International Speech Communication Association, 2072–76. Hyderabad, India: International Speech Communication Association.
Ljubešić, Nikola, Tomaž Erjavec, and Darja Fišer
Ljubešić, Nikola, Darja Fišer, and Tomaž Erjavec
2019 “KAS-Term: Extracting Slovene Terms from Doctoral Theses via Supervised Machine Learning.” In Text, Speech, and Dialogue. TSD 2019. Vol. 11697. Lecture Notes in Computer Science. Springer. http://arxiv.org/abs/1906.02053.
Loukachevitch, Natalia, and Michael Nokel
Macken, Lieve, Els Lefever, and Véronique Hoste
Mayorov, V., I. Andrianov, Nikita Astrakhantsev, Avanesov, V., Kozlov, I., and Turdakov, D.
McCrae, John P., and Adrian Doyle
Meijer, Kevin, Flavius Frasincar, and Frederik Hogenboom
Meyers, Adam L., Yifan He, Zachary Glass, John Ortega, Shasha Liao, Angus Grieve-Smith, Ralph Grishman, and Olga Babko-Malaya
Oostdijk, Nelleke, Martin Reynaert, Véronique Hoste, and Ineke Schuurman
Patry, Alexandre, and Philippe Langlais
Pedregosa, Fabian, Gael Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, et al.
Peñas, Anselmo, Felisa Verdejo, and Julio Gonzalo
Petrov, Slav, Dipanjan Das, and Ryan McDonald
Pollak, Senja, Andraž Repar, Matej Martinc, and Vid Podpečan
Qasemizadeh, Behrang, and Siegfried Handschuh
Ramisch, Carlos, Aline Villavicencio, and Christian Boitet
Rigouts Terryn, Ayla, Patrick Drouin, Véronique Hoste, and Els Lefever
Rigouts Terryn, Ayla, Véronique Hoste, Patrick Drouin, and Els Lefever
2020 “TermEval 2020: Shared Task on Automatic Term Extraction Using the Annotated Corpora for Term Extraction Research (ACTER) Dataset.” In Proceedings of the LREC 2020 6th International Workshop on Computational Terminology (COMPUTERM 2020), 85–94. Marseille, France: European Language Resources Association.
Rigouts Terryn, Ayla, Véronique Hoste, and Els Lefever
2018 “A Gold Standard for Multilingual Automatic Term Extraction from Comparable Corpora: Term Structure and Translation Equivalents.” In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC), 1803–8. Miyazaki, Japan: European Language Resources Association.
Šajatović, Antonio, Maja Buljan, Jan Šnajder, and Bojana Dalbelo Bašić
Shah, Sapan, S. Sarath, and Reddy Shreedhar
Ville-Ometz, Fabienne, Jean Royauté, and Alain Zasadzinski
Vivaldi, Jorge, Luís Màrquez, and Horacio Rodríguez
Vivaldi, Jorge, and Horacio Rodríguez
Wang, Rui, Wei Liu, and Chris McDonald
Wolf, Petra, Ulrike Bernardini, Christian Federmann, and Hunsicker Sabine
2011 “From Statistical Term Extraction to Hybrid Machine Translation.” In Proceedings of the 15th Conference of the European Association for Machine Translation, edited by Mikel L. Forcada, Heidi Depraetere, and Vincent Vandeghinste, 225–32. Leuven, Belgium: European Association for Machine Translation.