Automatic term extraction (ATE) is an important task within natural language processing, both separately, and as a preprocessing step for other tasks. In recent years, research has moved far beyond the traditional hybrid approach where candidate terms are extracted based on part-of-speech patterns and filtered and sorted with statistical termhood and unithood measures. While there has been an explosion of different types of features and algorithms, including machine learning methodologies, some of the fundamental problems remain unsolved, such as the ambiguous nature of the concept “term”. This has been a hurdle in the creation of data for ATE, meaning that datasets for both training and testing are scarce, and system evaluations are often limited and rarely cover multiple languages and domains. The ACTER Annotated Corpora for Term Extraction Research contain manual term annotations in four domains and three languages and have been used to investigate a supervised machine learning approach for ATE, using a binary random forest classifier with multiple types of features. The resulting system (HAMLET Hybrid Adaptable Machine Learning approach to Extract Terminology) provides detailed insights into its strengths and weaknesses. It highlights a certain unpredictability as an important drawback of machine learning methodologies, but also shows how the system appears to have learnt a robust definition of terms, producing results that are state-of-the-art, and contain few errors that are not (part of) terms in any way. Both the amount and the relevance of the training data have a substantial effect on results, and by varying the training data, it appears to be possible to adapt the system to various desired outputs, e.g., different types of terms. While certain issues remain difficult – such as the extraction of rare terms and multiword terms – this study shows how supervised machine learning is a promising methodology for ATE.
Amjadian, Ehsan, Diana Zaiu Inkpen, T. Sima Paribakht, and Farahnaz Faez. 2018. “Distributed Specificity for Automatic Terminology Extraction.” Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication 24 (1): 23–40.
Astrakhantsev, Nikita, D. Fedorenko, and D. Yu. Turdakov. 2015. “Methods for Automatic Term Recognition in Domain-Specific Text Collections: A Survey.” Programming and Computer Software 41 (6): 336–49.
Azé, Jérôme, Mathieu Roche, Yves Kodratoff, and Michèle Sebag. 2005. “Preference Learning in Terminology Extraction: A ROC-Based Approach.” In Proceeedings of Applied Stochastic Models and Data Analysis, 209–2019. Brest, France. [URL]
Barrón-Cedeño, Alberto, Gerardo Sierra, Patrick Drouin, and Sophia Ananiadou. 2009. “An Improved Automatic Term Recognition Method for Spanish.” In Computational Linguistics and Intelligent Text Processing, edited by Alexander Gelbukh, 125–36. Lecture Notes in Computer Science. Berlin, Heidelberg: Springer Berlin Heidelberg.
Bolshakova, Elena, Natalia Loukachevitch, and Michael Nokel. 2013. “Topic Models Can Improve Domain Term Extraction.” In Advances in Information Retrieval, edited by Pavel Serdyukov, Pavel Braslavski, Sergei O. Kuznetsov, Jaap Kamps, Stefan Rüger, Eugene Agichtein, Ilya Segalovich, and Emine Yilmaz, 78141:684–87. Berlin, Heidelberg: Springer Berlin Heidelberg.
Bordea, Georgeta, Paul Buitelaar, and Tamara Polajnar. 2013. “Domain-Independent Term Extraction Through Domain Modelling.” In Proceedings of the 10th International Conference for Terminology and Artificial Intelligence (TIA), 61–68. Paris, France.
Conrado, Merley da Silva, Thiago A. Salgueiro Pardo, and Solange Oliveira Rezende. 2013. “A Machine Learning Approach to Automatic Term Extraction Using a Rich Feature Set.” In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop, 16–23. Atlanta, GA, USA: Association for Computational Linguistics.
Davies, Mark. 2017. “The New 4.3 Billion Word NOW Corpus, with 4--5 Million Words of Data Added Every Day.” In Proceedings of the 9th International Corpus Linguistics Conference. Birmingham. Birmingham, UK. [URL]
Drouin, Patrick, Marie-Claude L’Homme, and Benoıt Robichaud. 2018. “Lexical Profiling of Environmental Corpora.” In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC), 3419–25. Miyazaki, Japan: European Language Resources Association.
Durán-Muñoz, Isabel. 2019. “Methodological Proposal to Build a Corpus-Based Ontology in Terminology.” Lingue e Linguaggi.
Fedorenko, Denis, Nikita Astrakhantsev, and Denis Turdakov. 2013. “Automatic Recognition of Domain-Specific Terms: An Experimental Evaluation.” In Proceedings of the Ninth Spring Researcher’s Colloquium on Database and Information Systems, 261:15–23. Kazan, Russia.
Foo, Jody, and Magnus Merkel. 2010. “Using Machine Learning to Perform Automatic Term Recognition.” In Proceedings of the LREC 2010 Workshop on Methods for Automatic Acquisition of Language Resources and Their Evaluation Methods, 49–54. Valetta, Malta: European Language Resources Association.
Gao, Yuze, and Yu Yuan. 2019. “Feature-Less End-to-End Nested Term Extraction.” ArXiv:1908.05426 [Cs, Stat], August. [URL].
Graff, David, Ângelo Mendonça, and Denise DiPersio. 2011. “French Gigaword Third Edition LDC2011T10.” Philadelphia, USA: Linguistic Data Consortium.
Hätty, Anna, and Sabine Schulte im Walde. 2018. “Fine-Grained Termhood Prediction for German Compound Terms Using Neural Networks.” In Proceedings of the Joint Workshop on,Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018), 62–73. Sante Fe, New Mexico, USA: Association for Computational Linguistics.
Hätty, Anna, Simon Tannert, and Ulrich Heid. 2017. “Creating a Gold Standard Corpus for Terminological Annotation from Online Forum Data.” In Proceedings of Language, Ontology, Terminology and Knowledge Structures Workshop (LOTKS 2017). Montpellier, France: Association for Computational Linguistics.
Hazem, Amir, Mérieme Bouhandi, Florian Boudin, and Béatrice Daille. 2020. “TermEval 2020: TALN-LS2N System for Automatic Term Extraction.” In Proceedings of the 6th International Workshop on Computational Terminology (COMPUTERM 2020), 95–100. Marseille, France: European Language Resources Association.
Judea, Alex, Hinrich Schütze, and Sören Brügmann. 2014. “Unsupervised Training Set Generation for Automatic Acquisition of Technical Terminology in Patents.” In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, 290–300. Dublin, Ireland: Dublin City University and Association for Computational Linguistics.
Kageura, Kyo, and Elizabeth Marshman. 2019. “Terminology Extraction and Management.” In The Routledge Handbook of Translation and Technology, edited by O’Hagan, Minako.
Kageura, Kyo, and Bin Umino. 1996. “Methods of Automatic Term Recognition.” Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication 3 (2): 259–89.
Karan, Mladen, Jan Snajder, and Dalbelo Basic, Bojana. 2012. “Evaluation of Classification Algorithms and Features for Collocation Extraction in Croatian.” In Proceedings of Eighth International Conference on Language Resources and Evaluation (LREC 2012), 657–62. Istanbul, Turkey: European Language Resources Association.
Kauter, Marian van de, Geert Coorman, Els Lefever, Bart Desmet, Lieve Macken, and Véronique Hoste. 2013. “LeTs Preprocess: The Multilingual LT3 Linguistic Preprocessing Toolkit.” Computational Linguistics in the Netherlands Journal 31: 103–20.
Kessler, Rémy, Nicolas Béchet, and Giuseppe Berio. 2019. “Extraction of Terminology in the Field of Construction.” In Proceedings of the First International Conference on Digital Data Processing (DDP), 22–26. London, UK: IEEE Computer Society.
Kosa, Victoria, David Chaves-Fraga, Hennadii Dobrovolskyi, and Vadim Ermolayev. 2020. “Optimized Term Extraction Method Based on Computing Merged Partial C-Values.” In Information and Communication Technologies in Education, Research, and Industrial Applications. ICTERI 2019, 11751:24–49. Communications in Computer and Information Science. Cham: Springer International Publishing.
Koutropoulou, Theoni, and Efstratios Efstratios. 2019. “TMG-BoBI: Generating Back-of-the-Book Indexes with the Text-to-Matrix-Generator.” In Proceedings of the 10th International Conference on Information, Intelligence, Systems and Applications, IISA 2019, 1–8. Patras, Greece.
Kozakov, L., Y. Park, T. Fin, Y. Drissi, Y. Doganata, and T. Cofino. 2004. “Glossary Extraction and Utilization in the Information Search and Delivery System for IBM Technical Support.” IBM Systems Journal 43 (3): 546–63.
Kucza, Maren, Jan Niehues, Thomas Zenkel, Alex Waibel, and Sebastian Stüker. 2018. “Term Extraction via Neural Sequence Labeling a Comparative Evaluation of Strategies Using Recurrent Neural Networks.” In Proceedings of Interspeech 2018, the 19th Annual Conference of the International Speech Communication Association, 2072–76. Hyderabad, India: International Speech Communication Association.
Ljubešić, Nikola, Tomaž Erjavec, and Darja Fišer. 2018. “KAS-Term and KAS-Biterm: Datasets and Baselines for Monolingual and Bilingual Terminology Extraction from Academic Writing.” Digital Humanities, 71.
Ljubešić, Nikola, Darja Fišer, and Tomaž Erjavec. 2019. “KAS-Term: Extracting Slovene Terms from Doctoral Theses via Supervised Machine Learning.” In Text, Speech, and Dialogue. TSD 2019. Vol. 116971. Lecture Notes in Computer Science. Springer. [URL].
Loukachevitch, Natalia. 2012. “Automatic Term Recognition Needs Multiple Evidence.” In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC), 2401–7. Istanbul, Turkey: European Language Resources Association.
Loukachevitch, Natalia, and Michael Nokel. 2013. “An Experimental Study of Term Extraction for Real Information-Retrieval Thesauri.” In Proceedings 10th International Conference on Terminology and Artificial Intelligence TIA 2013, 69–76. Paris, France.
Mayorov, V., I. Andrianov, Nikita Astrakhantsev, Avanesov, V., Kozlov, I., and Turdakov, D.2015. “A High Precision Method for Aspect Extraction in Russian.” In Computational Linguistics and Intellectual Technologies: Papers from the Annual Conference “Dialogue.” Vol. 21. Moscow, Russia.
McCrae, John P., and Adrian Doyle. 2019. “Adapting Term Recognition to an Under-Resourced Language: The Case of Irish.” In Proceedings of the Celtic Language Technology Workshop, 48–57. Dublin, Ireland.
Meijer, Kevin, Flavius Frasincar, and Frederik Hogenboom. 2014. “A Semantic Approach for Extracting Domain Taxonomies from Text.” Decision Support Systems 621 (June): 78–93.
Meyers, Adam L., Yifan He, Zachary Glass, John Ortega, Shasha Liao, Angus Grieve-Smith, Ralph Grishman, and Olga Babko-Malaya. 2018. “The Termolator: Terminology Recognition Based on Chunking, Statistical and Search-Based Scores.” Frontiers in Research Metrics and Analytics 31 (June).
Oostdijk, Nelleke, Martin Reynaert, Véronique Hoste, and Ineke Schuurman. 2013. “The Construction of a 500-Million-Word Reference Corpus of Contemporary Written Dutch.” In Essential Speech and Language Technology for Dutch, edited by Peter Spyns and Jan Odijk, 219–47. Berlin, Heidelberg: Springer Berlin Heidelberg.
Patry, Alexandre, and Philippe Langlais. 2005. “Corpus-Based Terminology Extraction.” In Terminology and Content Development – Proceedings of the 7th International Conference on Terminology and Knowledge Engineering, 313–21. Copenhagen, Denmark.
Pedregosa, Fabian, Gael Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, et al.2011. “Scikit-Learn: Machine Learning in Python.” Machine Learning in Python, no. 121: 2825–30.
Peñas, Anselmo, Felisa Verdejo, and Julio Gonzalo. 2001. “Corpus-Based Terminology Extraction Applied to Information Access.” In Proceedings of Corpus Linguistics, 91. Lancaster, UK.
Petrov, Slav, Dipanjan Das, and Ryan McDonald. 2012. “A Universal Part-of-Speech Tagset.” In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), 2089–96. Istanbul, Turkey: European Language Resources Association.
Pollak, Senja, Andraž Repar, Matej Martinc, and Vid Podpečan. 2019. “Karst Exploration: Extracting Terms and Definitions from Karst Domain Corpus.” In Proceedings of ELex 2019, 934–56. Sintra, Portugal.
Qasemizadeh, Behrang, and Siegfried Handschuh. 2014. “Evaluation of Technology Term Recognition with Random Indexing.” In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC), 4027–32. Reykjavik, Iceland: European Language Resources Association.
Ramisch, Carlos, Aline Villavicencio, and Christian Boitet. 2010. “Mwetoolkit: A Framework for Multiword Expression Identification.” In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10), 662–69. Valetta, Malta: European Language Resources Association.
Rigouts Terryn, Ayla, Patrick Drouin, Véronique Hoste, and Els Lefever. 2019. “Analysing the Impact of Supervised Machine Learning on Automatic Term Extraction: HAMLET vs TermoStat.” In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019), 1012–21. Varna, Bulgaria.
Rigouts Terryn, Ayla, Véronique Hoste, Patrick Drouin, and Els Lefever. 2020. “TermEval 2020: Shared Task on Automatic Term Extraction Using the Annotated Corpora for Term Extraction Research (ACTER) Dataset.” In Proceedings of the LREC 2020 6th International Workshop on Computational Terminology (COMPUTERM 2020), 85–94. Marseille, France: European Language Resources Association.
Rigouts Terryn, Ayla, Véronique Hoste, and Els Lefever. 2018. “A Gold Standard for Multilingual Automatic Term Extraction from Comparable Corpora: Term Structure and Translation Equivalents.” In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC), 1803–8. Miyazaki, Japan: European Language Resources Association.
Rigouts Terryn, Ayla, Véronique Hoste, and Els Lefever. 2020. “In No Uncertain Terms: A Dataset for Monolingual and Multilingual Automatic Term Extraction from Comparable Corpora.” Language Resources and Evaluation 54 (2): 385–418.
Šajatović, Antonio, Maja Buljan, Jan Šnajder, and Bojana Dalbelo Bašić. 2019. “Evaluating Automatic Term Extraction Methods on Individual Documents.” In Proceedings of the Joint Workshop on Multiword Expressions and WordNet (MWE-WN 2019), 149–54. Florence, Italy: Association for Computational Linguistics.
Shah, Sapan, S. Sarath, and Reddy Shreedhar. 2019. “Similarity Driven Unsupervised Learning for Materials Science Terminology Extraction.” Computación y Sistemas 23 (3): 1005–13.
Vintar, Spela. 2010. “Bilingual Term Recognition Revisited.” Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication 16 (2): 141–58.
Vivaldi, Jorge, Luís Màrquez, and Horacio Rodríguez. 2001. “Improving Term Extraction by System Combination Using Boosting.” In Proceedings of the 12th European Conference on Machine Learning (ECML 2001), edited by Luc Raedt and Peter Flach, 21671:515–26. Berlin, Heidelberg: Springer Berlin Heidelberg.
Wang, Rui, Wei Liu, and Chris McDonald. 2016. “Featureless Domain-Specific Term Extraction with Minimal Labelled Data.” In Proceedings of Australasian Language Technology Association Workshop, 103–12. Melbourne, Australia.
Wolf, Petra, Ulrike Bernardini, Christian Federmann, and Hunsicker Sabine. 2011. “From Statistical Term Extraction to Hybrid Machine Translation.” In Proceedings of the 15th Conference of the European Association for Machine Translation, edited by Mikel L. Forcada, Heidi Depraetere, and Vincent Vandeghinste, 225–32. Leuven, Belgium: European Association for Machine Translation.
Wolpert, David H.1996. “The Lack of a Priori Distinctions between Learning Algorithms.” Neural Computation 8 (7): 1341–90.
Cited by (5)
Cited by five other publications
Lefever, Els & Ayla Rigouts Terryn
2024. Computational Terminology. In New Advances in Translation Technology [New Frontiers in Translation Studies, ], ► pp. 141 ff.
Tran, Hanh Thi Hong, Matej Martinc, Andraz Repar, Nikola Ljubešić, Antoine Doucet & Senja Pollak
2024. Can cross-domain term extraction benefit from cross-lingual transfer and nested term labeling?. Machine Learning 113:7 ► pp. 4285 ff.
Rigouts Terryn, Ayla, Véronique Hoste & Els Lefever
2022. Tagging terms in text. Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication 28:1 ► pp. 157 ff.
Tran, Hanh Thi Hong, Matej Martinc, Antoine Doucet & Senja Pollak
2022. Can Cross-Domain Term Extraction Benefit from Cross-lingual Transfer?. In Discovery Science [Lecture Notes in Computer Science, 13601], ► pp. 363 ff.
Tran, Hanh Thi Hong, Matej Martinc, Andraz Pelicon, Antoine Doucet & Senja Pollak
2022. Ensembling Transformers for Cross-domain Automatic Term Extraction. In From Born-Physical to Born-Virtual: Augmenting Intelligence in Digital Libraries [Lecture Notes in Computer Science, 13636], ► pp. 90 ff.
This list is based on CrossRef data as of 27 september 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.