Improving term candidates selection using terminological tokens

Vàzquez, Mercè; Oliver, Antoni

doi:10.1075/term.00016.vaz

Article published In:

Computational terminology and filtering of terminological information
Edited by Patrick Drouin, Natalia Grabar, Thierry Hamon, Kyo Kageura and Koichi Takeuchi
[Terminology 24:1] 2018
► pp. 122–147

Improving term candidates selection using terminological tokens

Mercè Vàzquez | Universitat Oberta de Catalunya

Antoni Oliver | Universitat Oberta de Catalunya

The identification of reliable terms from domain-specific corpora using computational methods is a task that has to be validated manually by specialists, which is a highly time-consuming activity. To reduce this effort and improve term candidate selection, we implemented the Token Slot Recognition method, a filtering method based on terminological tokens which is used to rank extracted term candidates from domain-specific corpora. This paper presents the implementation of the term candidates filtering method we developed in linguistic and statistical approaches applied for automatic term extraction using several domain-specific corpora in different languages. We observed that the filtering method outperforms term candidate selection by ranking a higher number of terms at the top of the term candidate list than raw frequency, and for statistical term extraction the improvement is between 15% and 25% both in precision and recall. Our analyses further revealed a reduction in the number of term candidates to be validated manually by specialists. In conclusion, the number of term candidates extracted automatically from domain-specific corpora has been reduced significantly using the Token Slot Recognition filtering method, so term candidates can be easily and quickly validated by specialists.

Keywords: automatic term extraction, terminology extraction, domain-specific corpora, terminological tokens, TSR filtering method, TBXTools, term candidates, terminological units

Article outline

1.Introduction
2.Background
3.Materials and methods
4.Results and discussion
- 4.1Experimental settings
- 4.2Term extraction procedure
- 4.3Results and evaluation
  - Results for JRC Economics English
    - Statistical term extraction
    - Linguistic term extraction
  - Results for JRC Economics Spanish
  - Results for JRC Economics French
  - Results for IULA Economics Spanish
  - Results for IULA Health Spanish
  - Results for TERMCAT Social Services Spanish
  - Results for TERMCAT Social Services Catalan
- 4.4Discussion
5.Conclusions and future work
References

Available under the Creative Commons Attribution-NonCommercial (CC BY-NC) 4.0 license.

For any use beyond this license, please contact the publisher at [email protected].

Published online: 31 May 2018

https://doi.org/10.1075/term.00016.vaz

References (67)

Ananiadou, Sofia

1988 Towards a Methodology for Automatic Term Recognition. Dissertation. University of Manchester, Institute of Science and Technology.

Ananiadou, Sophia

1994a “A Computational Linguistic Approach to Automatic Term Recognition.” In Proceedings of the 3rd International Society for Knowledge Organization (ISKO 1994) 41: 134–141. Copenhagen, Denmark: Indeks Verlag.

1994b “A Methodology for Automatic Term Recognition.” In Proceedings of the 15th International Conference on Computational Linguistics (COLING 1994) 21: 1034–1038. Kyoto, Japan.

Arppe, Antti

1995 “Term Extraction from Unrestricted Text.” In Proceedings of the 10th Nordic Conference on Computational Linguistics (NODALIDA 1995). Helsinki, Finland: Department of General Linguistics.

Aubin, Sophie, and Thierry Hamon

2006 “Improving Term Extraction with Terminological Resources.” In Advances in Natural Language Processing. Lecture Notes in Computer Science 41391. Berlin, Heidelberg: Springer.

Badia, Toni, Mercè Pujol, Antoni Tuells, Jorge Vivaldi, Lluis de Yzaguirre, and Teresa Cabré

1998 “IULA’s LSP Multilingual Corpus: Compilation and Processing.” In Proceedings of the 1st International Conference on Language Resources and Evaluation. Granada, Spain.

Basili, Roberto, Gianluca De Rossi, and Maria Teresa Pazienza

1997 “Inducing Terminology for Lexical Acquisition.” In Proceedings of the 2nd Conference on Empirical Methods in Natural Language Processing Conference (EMNLP 1997). Providence, USA. ([URL]). Accessed 15 February 2018

Bentounsi, Imene, and Zizette Boufaida

2013 “Extracting Candidate Terms from Medical Texts.” In International Conference on Computer Systems and Applications (AICCSA): 1–4. Fes, Morocco.

Bourigault, Didier

1992 “Surface Grammatical Analysis for the Extraction of Terminological Noun Phrases.” In Proceedings of the 14th Conference on Computational linguistics (COLING 1992) 31: 977–981. Nantes, France.

Bourigault, Didier, Isabelle Gonzalez-Mullier, and Cécile Gros

1996 “LEXTER, a Natural Language Processing Tool for Terminology Extraction.” In Proceedings of the 7th European Association for Lexicography International Congress on Lexicography International Congress (EURALEX 1996): 771–779. Göteborg, Sweden: Göteborg University.

Bourigault, Didier, Christian Jacquemin, and Marie-Claude L’Homme

2001 “Introduction.” Recent Advances in Computational Terminology 21, ed. by Didier Bourigault, Christian Jacquemin, and Marie-Claude L’Homme, iix–xviii. John Benjamins.

Bouslimi, Riadh, Jalel Akaichi, Mouhamed Gaith Ayadi and Hana Hedhli

2016 “A Medical Collaboration Network for Medical Image Analysis.” Network Modeling Analysis in Health Informatics and Bioinformatics 5(1): 1–11.

Carreras, Xavier, Isaac Chao, Lluís Padró and Muntsa Padró

2004 “FreeLing: An Open-Source Suite of Language Analyzers.” In Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC 2004). Lisbon, Portugal.

Conrado, Merley S., Thiago A. S. Pardo, and Solange O. Rezende

2013 “Exploration of a Rich Feature Set for Automatic Term Extraction.” Advances in Artificial Intelligence and Its Applications 82651: 342–354. Lecture Notes in Computer Science. Berlin, Heidelberg: Springer.

Dagan, Ido, and Ken Church

1994 “Termight: Identifying and Translating Technical Terminology.” Proceedings of the 4th Conference on Applied Natural Language Processing: 34–40. Stuttgart, Germany.

David, Sophie, and Pierre Plante

1990 “Le progiciel TERMINO : de la nécessité d’une analyse morphosyntaxique pour le dépouillement terminologique des textes.” In Actes du Colloque international sur les industries de la langue : perspectives des années 1990 11: 71–88. Montreal, Canada.

Drouin, Patrick

1997 “Une méthodologie d’identification automatique des syntagmes terminologiques: l’apport de la description du non-terme.” Meta: Journal des traducteurs 42(1): 45–54.

Daille, Béatrice

1994 Approche mixte pour l’extraction de terminologie: statistique lexicale et filtres linguistiques. Dissertation. Université de Paris 7.

1995 Combined Approach for Terminology Extraction: Lexical Statistics and Linguistic Filtering. 51. Lancaster, United Kingdom: UCREL Technical Papers.

1997 “Study and Implementation of Combined Techniques for Automatic Extraction of Terminology.” The Balancing Act: Combining Symbolic and Statistical Approaches to Language 11: 49–66. Boston: Massachusetts Institute of Technology.

Dias, Gaël

2003 “Multiword Unit Hybrid Extraction.” In Proceedings of the ACL Workshop on Multiword Expressions: Analysis, Acquisition and Treatment (MWE 2003) 181: 41–48. Sapporo, Japan.

Dramé, Khadim, Gallo Diallo, Fleur Delva, Jean François Dartigues, Evelyne Mouillet, Roger Salamon and Fleur Mougin

2014 “Reuse of Termino-ontological Resources and Text Corpora for Building a Multilingual Domain Ontology: an Application to Alzheimer’s Disease.” Journal of biomedical informatics 481: 171–182.

Earl, Lois L.

1970 “Experiments in Automatic Extracting and Indexing.” Information Storage and Retrieval 6(4): 313–330.

Enguehard, Chantal, and Laurent Pantera

1995 “Automatic Natural Acquisition of a Terminology.” Journal of Quantitative Linguistics 2(1): 27–32.

Evans, David A., and Chengxiang Zhai

1996 “Noun-phrase Analysis in Unrestricted Text for Information Retrieval.” In Proceedings of the 34th Annual Meeting on Association for Computational Linguistics (ACL 1996): 17–24. Santa Cruz, California, USA.

Evert, Stefan, and Brigitte Krenn

2001 “Methods for the Qualitative Evaluation of Lexical Association Measures.” In Proceedings of the 39th Annual Meeting on Association for Computational Linguistics: 188–195.

Evert, Stefan

2005 The Statistics of Word Cooccurrences: Word Pairs and Collocations. Dissertation. University of Stuttgart.

Fabre, Cécile

1996 Interprétation automatique des séquences binominales en anglais et en français. Application à la recherche d’informations. Dissertation. Université de Rennes 1.

Fedorenko, Denis G., Nikita Astrakhantsev, and Denis Turdakov

2013 “Automatic Recognition of Domain-specific Terms: an Experimental Evaluation.” In Proceedings of the Institute for System Programming of the RAS (ISP RAS) 26(4): 15–23. Russia.

Foo, Jody

2012 Computational Terminology: Exploring Bilingual and Monolingual Term Extraction. Dissertation. Linköping University.

Frantzi, Katerina T., and Sophia Ananiadou

1997 “Automatic Term Recognition using Contextual Cues.” In Proceedings of the 3rd DELOS Workshop: 19–27. Zurich, Suisse.

Gornostay, Tatiana

2010 “Terminology Management in Real Use.” In Proceedings of the 5th International Conference on Applied Linguistics in Science and Education: 25–26. Saint Petersburg, Russia.

Heid, Ulrich, and John McNaught

1991 EUROTRA-7 Study: Feasibility and Project Definition Study on the Reusability of Lexical and Terminological Resources in Computerised Applications. Final Report. CEC-DG XIII. University of Stuttgart.

Jacquemin, Christian

1994 “FASTR: A Unification-based Front-end to Automatic Indexing.” In Proceedings of the 4th International Conference on Computer-Assisted Information Retrieval (Recherche d’information et ses Applications) (RIAO 1994) 21: 34–47. New York, USA: Rockfeller University Press.

1999 “Syntagmatic and Paradigmatic Representations of Term Variation.” In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL 1999): 341–348. College Park, Maryland, USA.

Jiang, Birong, Endong Xun, and Jianzhong Qi

2015 “A Domain Independent Approach for Extracting Terms from Research Papers”. In Databases Theory and Applications. ADC 2015, ed. by Mohamed Sharaf, Muhammad Cheema, and Jianzhong Qi, 155–166. Australia. Lecture Notes in Computer Science, vol 90931. Heidelberg, Berlin: Springer.

Justeson, John S., and Slava M. Katz

1995 “Technical Terminology: some Linguistic Properties and an Algorithm for Identification in Text.” Natural Language Engineering 1(1): 9–27.

Kageura, Kyo, and Bin Umino

1996 “Methods of Automatic Term Recognition: A Review.” Terminology 3(2): 259–289.

Loukachevitch, Natalia V.

2012 “Automatic Term Recognition Needs Multiple Evidence.” In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012): 2401–2407. Istanbul, Turkey.

Liu, Bao, Guiping Zhang, and Dongfeng Cai

2008 “Technical Term Automatic Extraction Research based on Statistics and Rules [J].” Computer Engineering and Applications 44(23): 147–150.

Lossio-Ventura, Juan Antonio, et al.

2014 “Yet Another Ranking Function for Automatic Multiword Term Extraction.” In Advances in Natural Language Processing. NLP 2014, ed. by Adam Przepiórkowski, and Maciej Ogrodniczuk, 52–64. Poland. Lecture Notes in Computer Science, vol 86861. Heidelberg, Berlin: Springer.

2016 “Biomedical Term Extraction: Overview and a New Methodology.” Information Retrieval Journal 19(1–2): 59–99.

Maynard, Diana, and Sophia Ananiadou

1999 “Identifying Contextual Information for Multi-word Term Extraction.” In Proceedings of Terminology and Knowledge Engineering Conference 991: 212–221. Innsbruck, Austria.

Messaoudi, Abir, Riadh Bouslimi, and Jalel Akaichi

2013 “Indexing Medical Images based on Collaborative Experts Reports.” International Journal of Computer Applications 70(5): 1–9.

McEnery, Tony, et al.

1997 “The Exploitation of Multilingual Annotated Corpora for Term Extraction.” Corpus Annotation: Linguistic Information from Computer Text Corpora: 220–230. Boston, MA, USA: Addison Wesley Longman.

Merkel, Magnus, and Mikael Andersson

2000 “Knowledge-lite Extraction of Multi-word Units with Language Filters and Entropy Thresholds.” In Proceedings of the 6th International Conference on Computer-Assisted Information Retrieval (Recherche d’Information et ses Applications) (RIAO 2000): 737–746. Paris, France.

Miller, George A.

1995 “WordNet: a Lexical Database for English.” Communications of the ACM 38(11): 39–41.

Naulleau, Elie

1998 Apprentissage et filtrage syntactico-sémantique de syntagmes nominaux pertinents pour la recherche documentaire. Dissertation. Université Paris XIII.

Nazarenko, Adeline, and Haifa Zargayouna

2009 “Evaluating Term Extraction.” In International Conference on Recent Advances in Natural Language Processing (RANLP 2009): 299–304. Borovets, Bulgaria.

Oliver, Antoni, Salvador Climent, and Joaquim Moré

2007 Traducción y tecnologías 41. Barcelona: Editorial UOC.

Oliver, Antoni, and Mercè Vàzquez

2015 “TBXTools: A Free, Fast and Flexible Tool for Automatic Terminology Extraction.” International Conference on Recent Advances in Natural Language Processing (RANLP 2015): 473–479. Hissar, Bulgaria.

Padró, Lluís, and Evgeny Stanilovsky

2012 “FreeLing 3.0: Towards Wider Multilinguality.” In Proceedings of the 8th International Conference on Language Resources and Evaluation Conference (LREC 2012): 2473–2479. Istanbul, Turkey.

Pazienza, Maria Teresa, Pennacchiotti, Marco, and Zanzotto, Fabio

2005 “Terminology Extraction: an Analysis of Linguistic and Statistical Approaches.” Knowledge Mining. Studies in Fuzziness and Soft Computing 1851: 255–279. Heidelberg, Berlin: Springer.

Pereira, Rui, Paul Crocker, and Gaël Dias

2004 “A Parallel Multikey Quicksort Algorithm for Mining Multiword Units.” In Proceedings of the Workshop on Methodologies and Evaluation of Multiword Units in Real-world Application: 17–23. Lisbon, Portugal.

Piao, Scott S., and McEnery, Tony

2001 “Multi-word unit Alignment in English-Chinese Parallel Corpora.” In Proceedings of the Corpus Linguistics Conference 131: 466–475. Lancaster. England.

Smadja, Frank

1993 “Retrieving Collocations from Text: Xtract”. Computational Linguistics 19(1): 143–177.

Valaski, Joselaine, Sheila Reinehr, and Andreia Malucelli

2015 “Approaches and Strategies to Extract Relevant Terms: How are they being applied?” In Proceedings of the International Conference on Artificial Intelligence (ICAI 2015): 478–484. The Steering Committee of the World Congress in Computer Science, Computer Engineering and Applied Computing (WorldComp). San Diego, USA.

Vasiljevs, Andrejs, Marcis Pinnis, and Tatiana Gornostay

2014 “Service Model for Semi-automatic Generation of Multilingual Terminology Resources.” In Proceedings of the Terminology and Knowledge Engineering Conference: 67–76. Berlin, Germany.

Vàzquez, Mercè, and Antoni Oliver

2013 “Improving Term Candidate Validation Using Ranking Metrics.” In Proceedings of the 3rd World Conference on Information Technology (WCIT-2012) 31: 1348–1359. AWERProcedia Information Technology & Computer Science. Barcelona, Spain.

Vàzquez, Mercè

2014 Estratègies estadístiques aplicades a l’extracció automàtica de terminologia. Dissertation. Universitat Pompeu Fabra.

Velardi, Paola, Michele Missikoff, and Roberto Basili

2001 “Identification of Relevant Terms to Support the Construction of Domain Ontologies.” In Proceedings of the Workshop on Human Language Technology and Knowledge Management – Volume 2001, 1–8. Association for Computational Linguistics. Morristown, USA.

Vivaldi, Jorge, and Horacio Rodríguez

2001 “Improving Term Extraction by Combining different Techniques.” Terminology 7(1): 31–48.

Vivaldi, Jorge

2009 “Corpus and Exploitation Tool: IULACT and BwanaNet.” In International Conference on Corpus Linguistics (CICL 2009), A survey on corpus-based research: 224–239. Universidad de Murcia, Spain.

Vossen, Piek

1998 A Multilingual Database with Lexical Semantic Networks. Dordrecht: Kluwer Academic Publishers.

Vu, Thuy, Ai Ti Aw, and Min Zhang

2008 “Term Extraction through Unithood and Termhood Unification.” In Proceedings of the 3rd International Joint Conference on Natural Language Processing (IJCNLP 2008) 11: 631–636. Hyderabad, India.

Wong, Wilson, Wei Liu, and Mohammed Bennamoun

2007 “Tree-traversing Ant Algorithm for Term Clustering based on Featureless Similarities.” Data Mining and Knowledge Discovery 15(3): 349–381.

Zheng, Dequan, Tiejun Zhao, and Jing Yang

2009 “Research on Domain Term Extraction based on Conditional Random Fields.” In International Conference on Computer Processing of Oriental Languages: 290–296. Heidelberg, Berlin: Springer.

Cited by (3)

Cited by 3 other publications

Lei, Lei, Yaochen Deng & Dilin Liu

2023. Examining research topics with a dependency-based noun phrase extraction method: a case in accounting. Library Hi Tech 41:2 ► pp. 570 ff.

Kister, Laurence & Evelyne Jacquey

2022. Identification d’occurrences de candidats termes dans des articles scientifiques. Corela :20-1

Martín-Chozas, Patricia, Karen Vázquez-Flores, Pablo Calleja, Elena Montiel-Ponsoda, Víctor Rodríguez-Doncel, Julia Bosque-Gil, Milan Dojchinovski, Philipp Cimiano, Julia Bosque-Gil, Philipp Cimiano & Milan Dojchinovski

2022. TermitUp: Generation and enrichment of linked terminologies. Semantic Web 13:6 ► pp. 967 ff.

This list is based on CrossRef data as of 10 july 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.