Tagging terms in text: A supervised sequential labelling approach to automatic term extraction

Rigouts Terryn, Ayla; Hoste, Véronique; Lefever, Els

doi:10.1075/term.21010.rig

Article published In:

Terminology
Vol. 28:1 (2022) ► pp.157–189

Tagging terms in text

A supervised sequential labelling approach to automatic term extraction

Ayla Rigouts Terryn | Ghent University

Véronique Hoste | Ghent University

Els Lefever | Ghent University

As with many tasks in natural language processing, automatic term extraction (ATE) is increasingly approached as a machine learning problem. So far, most machine learning approaches to ATE broadly follow the traditional hybrid methodology, by first extracting a list of unique candidate terms, and classifying these candidates based on the predicted probability that they are valid terms. However, with the rise of neural networks and word embeddings, the next development in ATE might be towards sequential approaches, i.e., classifying each occurrence of each token within its original context. To test the validity of such approaches for ATE, two sequential methodologies were developed, evaluated, and compared: one feature-based conditional random fields classifier and one embedding-based recurrent neural network. An additional comparison was added with a machine learning interpretation of the traditional approach. All systems were trained and evaluated on identical data in multiple languages and domains to identify their respective strengths and weaknesses. The sequential methodologies were proven to be valid approaches to ATE, and the neural network even outperformed the more traditional approach. Interestingly, a combination of multiple approaches can outperform all of them separately, showing new ways to push the state-of-the-art in ATE.

Keywords: terminology, automatic term extraction, sequential labelling

Article outline

1.Introduction
2.Related research
- 2.1Machine learning approaches
- 2.2Evaluation
- 2.3Features
- 2.4Sequential approaches
3.Data
4.System description
- 4.1CRFSuite feature-based sequential ATE
- 4.2FlairNLP neural, embedding-based sequential ATE
- 4.3HAMLET machine learning approach to traditional hybrid ATE
5.Experiments and results
- 5.1Experimental setup
- 5.2CRF results
- 5.3RNN results
6.Analyses and discussion of results
- 6.1Choice of experiments and motivation
- 6.2Results per corpus
- 6.3Sequential, neural approach vs. traditional, feature-based approach
- 6.4Complementarity of results
7.RNN error analysis
8.Conclusion
Notes
References

Published online: 10 January 2022

https://doi.org/10.1075/term.21010.rig

References

Agić, Željko, and Ivan Vulić

2019 ‘JW300: A Wide-Coverage Parallel Corpus for Low-Resource Languages’. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 3204–10. Florence, Italy: Association for Computational Linguistics.

Akbik, Alan, Tanja Bergmann, Duncan Blythe, Kashif Rasul, Stefan Schweter, and Roland Vollgraf

2019 ‘FLAIR: An Easy-to-Use Framework for State-of-the-Art NLP’. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, 54–59. Minneapolis, USA: Association for Computational Linguistics.

Akbik, Alan, Duncan Blythe, and Roland Vollgraf

2018 ‘Contextual String Embeddings for Sequence Labeling’. In Proceedings of the 27th International Conference on Computational Linguistics, 1638–49. Sante Fe, New Mexico, USA: Association for Computational Linguistics.

Alami Merrouni, Zakariae, Bouchra Frikh, and Brahim Ouhbi

2020 ‘Automatic Keyphrase Extraction: A Survey and Trends’. Journal of Intelligent Information Systems 54 (2): 391–424.

Amjadian, Ehsan, Diana Inkpen, T. Sima Paribakht, and Farahnaz Faez

2016 ‘Local-Global Vectors to Improve Unigram Terminology Extraction’. In Proceedings of the 5th International Workshop on Computational Terminology, 2–11. Osaka, Japan.

Amjadian, Ehsan, Diana Zaiu Inkpen, T. Sima Paribakht, and Farahnaz Faez

2018 ‘Distributed Specificity for Automatic Terminology Extraction’. Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication 24 (1): 23–40.

Astrakhantsev, Nikita, D. Fedorenko, and D. Yu. Turdakov

2015 ‘Methods for Automatic Term Recognition in Domain-Specific Text Collections: A Survey’. Programming and Computer Software 41 (6): 336–49.

Bay, Matthias, Daniel Bruneß, Miriam Herold, Christian Schulze, Michael Guckert, and Mirjam Minor

2020 ‘Term Extraction from Medical Documents Using Word Embeddings’. In Proceedings of the 4th IEEE Conference on Machine Learning and Natural Language Processing (MNLP). Agadir, Morocco: IEEE Computer Society.

Bojanowski, Piotr, Edouard Grave, Armand Joulin, and Tomas Mikolov

2016 ‘Enriching Word Vectors with Subword Information’. ArXiv Preprint in ArXiv:1607.04606 [Cs]. [URL]

Bourigault, Didier

1992 ‘Surface Grammatical Analysis for the Extraction of Terminological Noun Phrases’. In Proceedings of the 14th Conference on Computational Linguistics-Volume 3, 977–81. Nantes, France: Association for Computational Linguistics.

1993 ‘An Endogeneous Corpus-Based Method for Structural Noun Phrase Disambiguation’. In Proceedings of the Sixth Conference of the European Chapter of the Association for Computational Linguistics, 81–86. Utrecht, Netherlands: Association for Computational Linguistics.

Cram, Damien, and Beatrice Daille

2016 ‘TermSuite: Terminology Extraction with Term Variant Detection’. In Proceedings of The 54th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 13–18. Berlin, Germany: Association for Computational Linguistics.

Crammer, Koby, Alex Kulesza, and Mark Dredze

2009 ‘Adaptive Regularization of Weight Vectors’. Advances in Neural Information Processing Systems 221: 414–22.

Davies, Mark

2017 ‘The New 4.3 Billion Word NOW Corpus, with 4--5 Million Words of Data Added Every Day’. In Proceedings of the 9th International Corpus Linguistics Conference. Birmingham. Birmingham, UK. [URL]

De Clercq, Orphée, Marjan Van de Kauter, Els Lefever, and Veronique Hoste

2015 ‘LT3: Applying Hybrid Terminology Extraction to Aspect-Based Sentiment Analysis’. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), 719–24. Denver, Colorado: Association for Computational Linguistics.

Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova

2019 ‘BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding’. ArXiv:1810.04805 [Cs]. [URL]

Dobrov, Boris, and Natalia Loukachevitch

2011 ‘Multiple Evidence for Term Extraction in Broad Domains’. In Proceedings of the International Conference Recent Advances in Natural Language Processing 2011, 710–15. Hissar, Bulgaria: Association for Computational Linguistics.

Drouin, Patrick

2003 ‘Term Extraction Using Non-Technical Corpora as a Point of Leverage’. Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication 9 (1): 99–115.

Drouin, Patrick, Jean-Benoît Morel, and Marie-Claude L’ Homme

2020 ‘Automatic Term Extraction from Newspaper Corpora: Making the Most of Specificity and Common Features’. In Proceedings of the 6th International Workshop on Computational Terminology (COMPUTERM 2020), 1–7. Marseille, France: ELRA.

Fedorenko, Denis, Nikita Astrakhantsev, and Denis Turdakov

2013 ‘Automatic Recognition of Domain-Specific Terms: An Experimental Evaluation’. In Proceedings of the Ninth Spring Researcher’s Colloquium on Database and Information Systems, 261:15–23. Kazan, Russia.

Goyal, Archana, Vishal Gupta, and Manish Kumar

2018 ‘Recent Named Entity Recognition and Classification Techniques: A Systematic Review’. Computer Science Review 291 (August): 21–43.

Graff, David, Ângelo Mendonça, and Denise DiPersio

2011 ‘French Gigaword Third Edition LDC2011T10’. Philadelphia, USA: Linguistic Data Consortium.

Habibi, Maryam, Leon Weber, Mariana Neves, David Luis Wiegandt, and Ulf Leser

2017 ‘Deep Learning with Word Embeddings Improves Biomedical Named Entity Recognition’. Bioinformatics 33 (14): i37–48.

Hätty, Anna, Michael Dorna, and Sabine Schulte im Walde

2017 ‘Evaluating the Reliability and Interaction of Recursively Used Feature Classes for Terminology Extraction’. In Proceedings of the Student Research Workshop at the 15th Conference of the European Chapter of the Association for Computational Linguistics, 113–21. Valencia, Spain: Association for Computational Linguistics.

Hätty, Anna, Dominik Schlechtweg, and Michael Dorna

2020 ‘Predicting Degrees of Technicality in Automatic Terminology Extraction’. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 72883–89. olnine: Association for Computational Linguistics.

Hazem, Amir, Mérieme Bouhandi, Florian Boudin, and Béatrice Daille

2020 ‘TermEval 2020: TALN-LS2N System for Automatic Term Extraction’. In Proceedings of the 6th International Workshop on Computational Terminology (COMPUTERM 2020), 95–100. Marseille, France: European Language Resources Association.

Kageura, Kyo, and Elizabeth Marshman

2019 ‘Terminology Extraction and Management’. In The Routledge Handbook of Translation and Technology, edited by O’Hagan, Minako.

Kageura, Kyo, and Bin Umino

1996 ‘Methods of Automatic Term Recognition’. Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication 3 (2): 259–89.

Kauter, Marian van de, Geert Coorman, Els Lefever, Bart Desmet, Lieve Macken, and Véronique Hoste

2013 ‘LeTs Preprocess: The Multilingual LT3 Linguistic Preprocessing Toolkit’. Computational Linguistics in the Netherlands Journal 31: 103–20.

Kim, J.-D., T. Ohta, Y. Tateisi, and J. Tsujii

2003 ‘GENIA Corpus – a Semantically Annotated Corpus for Bio-Textmining’. Bioinformatics 19 (1): 180–82.

Kingma, Diederik P., and Jimmy Ba

2015 ‘Adam: A Method for Stochastic Optimization’. In Proceedings of 3rd International Conference for Learning Representations. San Diego, CA. [URL]

Koutropoulou, Theoni, and Efstratios Efstratios

2019 ‘TMG-BoBI: Generating Back-of-the-Book Indexes with the Text-to-Matrix-Generator’. In Proceedings of the 10th International Conference on Information, Intelligence, Systems and Applications, IISA 2019, 1–8. Patras, Greece.

Kucza, Maren, Jan Niehues, Thomas Zenkel, Alex Waibel, and Sebastian Stüker

2018 ‘Term Extraction via Neural Sequence Labeling a Comparative Evaluation of Strategies Using Recurrent Neural Networks’. In Proceedings of Interspeech 2018, the 19th Annual Conference of the International Speech Communication Association, 2072–76. Hyderabad, India: International Speech Communication Association.

Loshchilov, Ilya, and Frank Hutter

2019 ‘Decoupled Weight Decay Regularization’. In Proceedings of the Seventh International Conference on Learning Representations. New Orleans, USA. [URL]

Macken, Lieve, Els Lefever, and Véronique Hoste

2013 ‘TExSIS: Bilingual Terminology Extraction from Parallel Corpora Using Chunk-Based Alignment’. Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication 19 (1): 1–30.

Martin, Louis, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric de la Clergerie, Djamé Seddah, and Benoît Sagot

2020 ‘CamemBERT: A Tasty French Language Model’. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 7203–19. Online: Association for Computational Linguistics.

McCrae, John P., and Adrian Doyle

2019 ‘Adapting Term Recognition to an Under-Resourced Language: The Case of Irish’. In Proceedings of the Celtic Language Technology Workshop, 48–57. Dublin, Ireland.

Meyers, Adam L., Yifan He, Zachary Glass, John Ortega, Shasha Liao, Angus Grieve-Smith, Ralph Grishman, and Olga Babko-Malaya

2018 ‘The Termolator: Terminology Recognition Based on Chunking, Statistical and Search-Based Scores’. Frontiers in Research Metrics and Analytics 31 (June).

Mikolov, Tomas, Wen-tau Yih, and Geoffrey Zweig

2013 ‘Linguistic Regularities in Continuous Space Word Representations’. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 746–51. Atlanta, GA, USA: Association for Computational Linguistics.

Okazaki, Naoaki

2007 CRFsuite: A Fast Implementation of Conditional Random Fields (CRFs). [URL]

Oostdijk, Nelleke, Martin Reynaert, Véronique Hoste, and Ineke Schuurman

2013 ‘The Construction of a 500-Million-Word Reference Corpus of Contemporary Written Dutch’. In Essential Speech and Language Technology for Dutch, edited by Peter Spyns and Jan Odijk, 219–47. Berlin, Heidelberg: Springer Berlin Heidelberg.

Paszke, Adam, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, et al.

2019 ‘PyTorch: An Imperative Style, High-Performance Deep Learning Library’. In Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), 8024–35. Vancouver, Canada. [URL]

Pedregosa, Fabian, Gael Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, et al.

2011 ‘Scikit-Learn: Machine Learning in Python’. Machine Learning in Python, no. 12: 2825–30.

Peters, Matthew, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer

2018 ‘Deep Contextualized Word Representations’. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 2227–37. New Orleans, Louisiana: Association for Computational Linguistics.

Petrov, Slav, Dipanjan Das, and Ryan McDonald

2012 ‘A Universal Part-of-Speech Tagset’. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), 2089–96. Istanbul, Turkey: European Language Resources Association.

Pires, Telmo, Eva Schlinger, and Dan Garrette

2019 ‘How Multilingual Is Multilingual BERT?’ In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 4996–5001. Florence, Italy: Association for Computational Linguistics.

Qasemizadeh, Behrang, and Siegfried Handschuh

2014 ‘The ACL RD-TEC: A Dataset for Benchmarking Terminology Extraction and Classification in Computational Linguistics’. In Proceedings of COLING 2014: 4th International Workshop on Computational Terminology, 52–63. Dublin, Ireland.

Qasemizadeh, Behrang, and Anne-Kathrin Schumann

2016 ‘The ACL RD-TEC 2.0: A Language Resource for Evaluating Term Extraction and Entity Recognition Methods’. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), 1862–68. Portorož, Slovenia: European Language Resources Association.

Rigouts Terryn, Ayla, Patrick Drouin, Véronique Hoste, and Els Lefever

2019 ‘Analysing the Impact of Supervised Machine Learning on Automatic Term Extraction: HAMLET vs TermoStat’. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019), 1012–21. Varna, Bulgaria.

Rigouts Terryn, Ayla, Véronique Hoste, Patrick Drouin, and Els Lefever

2020 ‘TermEval 2020: Shared Task on Automatic Term Extraction Using the Annotated Corpora for Term Extraction Research (ACTER) Dataset’. In Proceedings of the 6th International Workshop on Computational Terminology (COMPUTERM 2020), 85–94. Marseille, France: European Language Resources Association.

Rigouts Terryn, Ayla, Véronique Hoste, and Els Lefever

2020 ‘In No Uncertain Terms: A Dataset for Monolingual and Multilingual Automatic Term Extraction from Comparable Corpora’. Language Resources and Evaluation 54 (2): 385–418.

2021 ‘HAMLET: Hybrid Adaptable Machine Learning Approach to Extract Terminology’. Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication 27 (2).

Rokas, Aivaras, Sigita Rackevičienė, and Andrius Utka

2020 ‘Automatic Extraction of Lithuanian Cybersecurity Terms Using Deep Learning Approaches’. In Proceedings of the Ninth International Conference on Baltic Human Language Technologies, 39–46. Kaunas, Lithuania: IOS Press.

Stenetorp, Pontus, Goran Topić, Sampo Pyysalo, Tomoko Ohta, Jin-Dong Kim, and Jun’ichi Tsujii

2011 ‘BioNLP Shared Task 2011: Supporting Resources’. In Proceedings of BioNLP Shared Task 2011 Workshop, 112–20. Portland, oregon: Association for Computational Linguistics.

Vintar, Spela

2010 ‘Bilingual Term Recognition Revisited’. Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication 16 (2): 141–58.

Vivaldi, Jorge, and Horacio Rodríguez

2001 ‘Improving Term Extraction by Combining Different Techniques’. Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication 7 (1): 31–48.

Vries, Wietse de, Andreas van Cranenburgh, Arianna Bisazza, Tommaso Caselli, Gertjan van Noord, and Malvina Nissim

2019 ‘BERTje: A Dutch BERT Model’. ArXiv:1912.09582, December. [URL]

Wang, Rui, Wei Liu, and Chris McDonald

2016 ‘Featureless Domain-Specific Term Extraction with Minimal Labelled Data’. In Proceedings of Australasian Language Technology Association Workshop, 103–12. Melbourne, Australia.

Wolf, Thomas, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, et al.

2020 ‘Transformers: State-of-the-Art Natural Language Processing’. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 38–45. Online: Association for Computational Linguistics.

Wołk, Krzysztof, and Krzysztof Marasek

2014 ‘Building Subject-Aligned Comparable Corpora and Mining It for Truly Parallel Sentence Pairs’. Procedia Technology 181: 126–32.

Yuan, Yu, Jie Gao, and Yue Zhang

2017 ‘Supervised Learning for Robust Term Extraction’. In The Proceedings of 2017 International Conference on Asian Language Processing (IALP), 302–5. Singapore: IEEE.

Zhang, Ziqi, Johann Petrak, and Diana Maynard

2018 ‘Adapted TextRank for Term Extraction: A Generic Method of Improving Automatic Term Extraction Algorithms’. ACM Transactions on Knowledge Discovery from Data 12 (5): 1–7.