Improving term candidates selection using terminological
tokens
The identification of reliable terms from domain-specific corpora using
computational methods is a task that has to be validated manually by
specialists, which is a highly time-consuming activity. To reduce this effort
and improve term candidate selection, we implemented the Token Slot Recognition
method, a filtering method based on terminological tokens which is used to rank
extracted term candidates from domain-specific corpora. This paper presents the
implementation of the term candidates filtering method we developed in
linguistic and statistical approaches applied for automatic term extraction
using several domain-specific corpora in different languages. We observed that
the filtering method outperforms term candidate selection by ranking a higher
number of terms at the top of the term candidate list than raw frequency, and
for statistical term extraction the improvement is between 15% and 25% both in
precision and recall. Our analyses further revealed a reduction in the number of
term candidates to be validated manually by specialists. In conclusion, the
number of term candidates extracted automatically from domain-specific corpora
has been reduced significantly using the Token Slot Recognition filtering
method, so term candidates can be easily and quickly validated by
specialists.
Article outline
- 1.Introduction
- 2.Background
- 3.Materials and methods
- 4.Results and discussion
- 4.1Experimental settings
- 4.2Term extraction procedure
- 4.3Results and evaluation
- Results for JRC Economics English
- Statistical term extraction
- Linguistic term extraction
- Results for JRC Economics Spanish
- Results for JRC Economics French
- Results for IULA Economics Spanish
- Results for IULA Health Spanish
- Results for TERMCAT Social Services Spanish
- Results for TERMCAT Social Services Catalan
- 4.4Discussion
- 5.Conclusions and future work
-
References
For any use beyond this license, please contact the publisher at rights@benjamins.nl.
References
Ananiadou, Sofia
1988 Towards a Methodology for Automatic Term Recognition. Dissertation. University of Manchester, Institute of Science and Technology.

Ananiadou, Sophia
1994a “
A Computational Linguistic Approach to Automatic Term Recognition.” In
Proceedings of the 3rd International Society for Knowledge Organization (
ISKO 1994) 41: 134–141. Copenhagen, Denmark: Indeks Verlag.

Ananiadou, Sophia
1994b “
A Methodology for Automatic Term Recognition.” In
Proceedings of the 15th International Conference on Computational Linguistics (COLING 1994) 21: 1034–1038. Kyoto, Japan.


Arppe, Antti
1995 “
Term Extraction from Unrestricted Text.” In
Proceedings of the 10th Nordic Conference on Computational Linguistics (NODALIDA 1995). Helsinki, Finland: Department of General Linguistics.

Aubin, Sophie, and Thierry Hamon
2006 “
Improving Term Extraction with Terminological Resources.” In
Advances in Natural Language Processing. Lecture Notes in Computer Science 41391. Berlin, Heidelberg: Springer.


Badia, Toni, Mercè Pujol, Antoni Tuells, Jorge Vivaldi, Lluis de Yzaguirre, and Teresa Cabré
1998 “
IULA’s LSP Multilingual Corpus: Compilation and Processing.” In
Proceedings of the 1st International Conference on Language Resources and Evaluation. Granada, Spain.

Basili, Roberto, Gianluca De Rossi, and Maria Teresa Pazienza
1997 “
Inducing Terminology for Lexical Acquisition.” In
Proceedings of the 2nd Conference on Empirical Methods in Natural Language Processing Conference (
EMNLP 1997). Providence, USA. (
[URL]). Accessed 15 February 2018
Bentounsi, Imene, and Zizette Boufaida
2013 “
Extracting Candidate Terms from Medical Texts.” In
International Conference on Computer Systems and Applications (AICCSA): 1–4. Fes, Morocco.


Bourigault, Didier
1992 “
Surface Grammatical Analysis for the Extraction of Terminological Noun Phrases.” In
Proceedings of the 14th Conference on Computational linguistics (
COLING 1992) 31: 977–981. Nantes, France.


Bourigault, Didier, Isabelle Gonzalez-Mullier, and Cécile Gros
1996 “
LEXTER, a Natural Language Processing Tool for Terminology Extraction.” In
Proceedings of the 7th European Association for Lexicography International Congress on Lexicography International Congress (
EURALEX 1996): 771–779. Göteborg, Sweden: Göteborg University.

Bourigault, Didier, Christian Jacquemin, and Marie-Claude L’Homme
2001 “
Introduction.”
Recent Advances in Computational Terminology 21, ed. by
Didier Bourigault,
Christian Jacquemin, and
Marie-Claude L’Homme, iix–xviii. John Benjamins.


Bouslimi, Riadh, Jalel Akaichi, Mouhamed Gaith Ayadi and Hana Hedhli
2016 “
A Medical Collaboration Network for Medical Image Analysis.”
Network Modeling Analysis in Health Informatics and Bioinformatics 5(1): 1–11.

Carreras, Xavier, Isaac Chao, Lluís Padró and Muntsa Padró
2004 “
FreeLing: An Open-Source Suite of Language Analyzers.” In
Proceedings of the 4th International Conference on Language Resources and Evaluation (
LREC 2004). Lisbon, Portugal.

Conrado, Merley S., Thiago A. S. Pardo, and Solange O. Rezende
2013 “
Exploration of a Rich Feature Set for Automatic Term Extraction.”
Advances in Artificial Intelligence and Its Applications 82651: 342–354.
Lecture Notes in Computer Science. Berlin, Heidelberg: Springer.


Dagan, Ido, and Ken Church
1994 “
Termight: Identifying and Translating Technical Terminology.”
Proceedings of the 4th Conference on Applied Natural Language Processing: 34–40. Stuttgart, Germany.


David, Sophie, and Pierre Plante
1990 “
Le progiciel TERMINO : de la nécessité d’une analyse morphosyntaxique pour le dépouillement terminologique des textes.” In
Actes du Colloque international sur les industries de la langue : perspectives des années 1990 11: 71–88. Montreal, Canada.

Drouin, Patrick
1997 “
Une méthodologie d’identification automatique des syntagmes terminologiques: l’apport de la description du non-terme.”
Meta: Journal des traducteurs 42(1): 45–54.


Daille, Béatrice
1994 Approche mixte pour l’extraction de terminologie: statistique lexicale et filtres linguistiques. Dissertation. Université de Paris 7.

Daille, Béatrice
1995 Combined Approach for Terminology Extraction: Lexical Statistics and Linguistic Filtering. 51. Lancaster, United Kingdom: UCREL Technical Papers.

Daille, Béatrice
1997 “
Study and Implementation of Combined Techniques for Automatic Extraction of Terminology.”
The Balancing Act: Combining Symbolic and Statistical Approaches to Language 11: 49–66. Boston: Massachusetts Institute of Technology.

Dias, Gaël
2003 “
Multiword Unit Hybrid Extraction.” In
Proceedings of the ACL Workshop on Multiword Expressions: Analysis, Acquisition and Treatment (
MWE 2003) 181: 41–48. Sapporo, Japan.

Dramé, Khadim, Gallo Diallo, Fleur Delva, Jean François Dartigues, Evelyne Mouillet, Roger Salamon and Fleur Mougin
2014 “
Reuse of Termino-ontological Resources and Text Corpora for Building a Multilingual Domain Ontology: an Application to Alzheimer’s Disease.”
Journal of biomedical informatics 481: 171–182.


Earl, Lois L.
1970 “
Experiments in Automatic Extracting and Indexing.”
Information Storage and Retrieval 6(4): 313–330.


Enguehard, Chantal, and Laurent Pantera
1995 “
Automatic Natural Acquisition of a Terminology.”
Journal of Quantitative Linguistics 2(1): 27–32.


Evans, David A., and Chengxiang Zhai
1996 “
Noun-phrase Analysis in Unrestricted Text for Information Retrieval.” In
Proceedings of the 34th Annual Meeting on Association for Computational Linguistics (
ACL 1996): 17–24. Santa Cruz, California, USA.


Evert, Stefan, and Brigitte Krenn
2001 “
Methods for the Qualitative Evaluation of Lexical Association Measures.” In
Proceedings of the 39th Annual Meeting on Association for Computational Linguistics: 188–195.

Evert, Stefan
2005 The Statistics of Word Cooccurrences: Word Pairs and Collocations. Dissertation. University of Stuttgart.

Fabre, Cécile
1996 Interprétation automatique des séquences binominales en anglais et en français. Application à la recherche d’informations. Dissertation. Université de Rennes 1.

Fedorenko, Denis G., Nikita Astrakhantsev, and Denis Turdakov
2013 “
Automatic Recognition of Domain-specific Terms: an Experimental Evaluation.” In
Proceedings of the Institute for System Programming of the RAS (
ISP RAS) 26(4): 15–23. Russia.

Foo, Jody
2012 Computational Terminology: Exploring Bilingual and Monolingual Term Extraction. Dissertation. Linköping University.

Frantzi, Katerina T., and Sophia Ananiadou
1997 “
Automatic Term Recognition using Contextual Cues.” In
Proceedings of the 3rd DELOS Workshop: 19–27. Zurich, Suisse.

Gornostay, Tatiana
2010 “
Terminology Management in Real Use.” In
Proceedings of the 5th International Conference on Applied Linguistics in Science and Education: 25–26. Saint Petersburg, Russia.

Heid, Ulrich, and John McNaught
1991 EUROTRA-7 Study: Feasibility and Project Definition Study on the Reusability of Lexical and Terminological Resources in Computerised Applications.
Final Report. CEC-DG XIII. University of Stuttgart.

Jacquemin, Christian
1994 “
FASTR: A Unification-based Front-end to Automatic Indexing.” In
Proceedings of the 4th International Conference on Computer-Assisted Information Retrieval (
Recherche d’information et ses Applications) (RIAO 1994) 21: 34–47. New York, USA: Rockfeller University Press.

Jacquemin, Christian
1999 “
Syntagmatic and Paradigmatic Representations of Term Variation.” In
Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (
ACL 1999): 341–348. College Park, Maryland, USA.

Jiang, Birong, Endong Xun, and Jianzhong Qi
2015 “
A Domain Independent Approach for Extracting Terms from Research Papers”. In
Databases Theory and Applications.
ADC 2015, ed. by
Mohamed Sharaf,
Muhammad Cheema, and
Jianzhong Qi, 155–166. Australia.
Lecture Notes in Computer Science, vol 90931. Heidelberg, Berlin: Springer.

Justeson, John S., and Slava M. Katz
1995 “
Technical Terminology: some Linguistic Properties and an Algorithm for Identification in Text.”
Natural Language Engineering 1(1): 9–27.


Kageura, Kyo, and Bin Umino
Loukachevitch, Natalia V.
2012 “
Automatic Term Recognition Needs Multiple Evidence.” In
Proceedings of the 8th International Conference on Language Resources and Evaluation (
LREC 2012): 2401–2407. Istanbul, Turkey.

Liu, Bao, Guiping Zhang, and Dongfeng Cai
2008 “
Technical Term Automatic Extraction Research based on Statistics and Rules [J].”
Computer Engineering and Applications 44(23): 147–150.

Lossio-Ventura, Juan Antonio, et al.
2014 “
Yet Another Ranking Function for Automatic Multiword Term Extraction.” In
Advances in Natural Language Processing.
NLP 2014, ed. by
Adam Przepiórkowski, and
Maciej Ogrodniczuk, 52–64. Poland.
Lecture Notes in Computer Science, vol 86861. Heidelberg, Berlin: Springer.


Lossio-Ventura, Juan Antonio, et al.
2016 “
Biomedical Term Extraction: Overview and a New Methodology.”
Information Retrieval Journal 19(1–2): 59–99.


Maynard, Diana, and Sophia Ananiadou
1999 “
Identifying Contextual Information for Multi-word Term Extraction.” In
Proceedings of Terminology and Knowledge Engineering Conference 991: 212–221. Innsbruck, Austria.

Messaoudi, Abir, Riadh Bouslimi, and Jalel Akaichi
2013 “
Indexing Medical Images based on Collaborative Experts Reports.”
International Journal of Computer Applications 70(5): 1–9.


McEnery, Tony, et al.
1997 “
The Exploitation of Multilingual Annotated Corpora for Term Extraction.”
Corpus Annotation: Linguistic Information from Computer Text Corpora: 220–230. Boston, MA, USA: Addison Wesley Longman.

Merkel, Magnus, and Mikael Andersson
2000 “
Knowledge-lite Extraction of Multi-word Units with Language Filters and Entropy Thresholds.” In
Proceedings of the 6th International Conference on Computer-Assisted Information Retrieval (
Recherche d’Information et ses Applications) (RIAO 2000): 737–746. Paris, France.

Miller, George A.
1995 “
WordNet: a Lexical Database for English.”
Communications of the ACM 38(11): 39–41.


Naulleau, Elie
1998 Apprentissage et filtrage syntactico-sémantique de syntagmes nominaux pertinents pour la recherche documentaire. Dissertation. Université Paris XIII.

Nazarenko, Adeline, and Haifa Zargayouna
2009 “
Evaluating Term Extraction.” In
International Conference on Recent Advances in Natural Language Processing (
RANLP 2009): 299–304. Borovets, Bulgaria.

Oliver, Antoni, Salvador Climent, and Joaquim Moré
2007 Traducción y tecnologías 41. Barcelona: Editorial UOC.

Oliver, Antoni, and Mercè Vàzquez
2015 “
TBXTools: A Free, Fast and Flexible Tool for Automatic Terminology Extraction.”
International Conference on Recent Advances in Natural Language Processing (
RANLP 2015): 473–479. Hissar, Bulgaria.

Padró, Lluís, and Evgeny Stanilovsky
2012 “
FreeLing 3.0: Towards Wider Multilinguality.” In
Proceedings of the 8th International Conference on Language Resources and Evaluation Conference (
LREC 2012): 2473–2479. Istanbul, Turkey.

Pazienza, Maria Teresa, Pennacchiotti, Marco, and Zanzotto, Fabio
2005 “
Terminology Extraction: an Analysis of Linguistic and Statistical Approaches.”
Knowledge Mining. Studies in Fuzziness and Soft Computing 1851: 255–279. Heidelberg, Berlin: Springer.

Pereira, Rui, Paul Crocker, and Gaël Dias
2004 “
A Parallel Multikey Quicksort Algorithm for Mining Multiword Units.” In
Proceedings of the Workshop on Methodologies and Evaluation of Multiword Units in Real-world Application: 17–23. Lisbon, Portugal.

Piao, Scott S., and McEnery, Tony
2001 “
Multi-word unit Alignment in English-Chinese Parallel Corpora.” In
Proceedings of the Corpus Linguistics Conference 131: 466–475. Lancaster. England.

Smadja, Frank
1993 “
Retrieving Collocations from Text: Xtract”.
Computational Linguistics 19(1): 143–177.

Valaski, Joselaine, Sheila Reinehr, and Andreia Malucelli
2015 “
Approaches and Strategies to Extract Relevant Terms: How are they being applied?” In
Proceedings of the International Conference on Artificial Intelligence (
ICAI 2015): 478–484. The Steering Committee of the World Congress in Computer Science, Computer Engineering and Applied Computing (WorldComp). San Diego, USA.
Vasiljevs, Andrejs, Marcis Pinnis, and Tatiana Gornostay
2014 “
Service Model for Semi-automatic Generation of Multilingual Terminology Resources.” In
Proceedings of the Terminology and Knowledge Engineering Conference: 67–76. Berlin, Germany.

Vàzquez, Mercè, and Antoni Oliver
2013 “
Improving Term Candidate Validation Using Ranking Metrics.” In
Proceedings of the 3rd World Conference on Information Technology (
WCIT-2012) 31: 1348–1359. AWERProcedia Information Technology & Computer Science. Barcelona, Spain.

Vàzquez, Mercè
2014 Estratègies estadístiques aplicades a l’extracció automàtica de terminologia. Dissertation. Universitat Pompeu Fabra.

Velardi, Paola, Michele Missikoff, and Roberto Basili
2001 “
Identification of Relevant Terms to Support the Construction of Domain Ontologies.” In
Proceedings of the Workshop on Human Language Technology and Knowledge Management – Volume 2001, 1–8. Association for Computational Linguistics. Morristown, USA.

Vivaldi, Jorge, and Horacio Rodríguez
Vivaldi, Jorge
2009 “
Corpus and Exploitation Tool: IULACT and BwanaNet.” In
International Conference on Corpus Linguistics (
CICL 2009),
A survey on corpus-based research: 224–239. Universidad de Murcia, Spain.

Vossen, Piek
1998 A Multilingual Database with Lexical Semantic Networks. Dordrecht: Kluwer Academic Publishers.


Vu, Thuy, Ai Ti Aw, and Min Zhang
2008 “
Term Extraction through Unithood and Termhood Unification.” In
Proceedings of the 3rd International Joint Conference on Natural Language Processing (
IJCNLP 2008) 11: 631–636. Hyderabad, India.

Wong, Wilson, Wei Liu, and Mohammed Bennamoun
2007 “
Tree-traversing Ant Algorithm for Term Clustering based on Featureless Similarities.”
Data Mining and Knowledge Discovery 15(3): 349–381.


Zheng, Dequan, Tiejun Zhao, and Jing Yang
2009 “
Research on Domain Term Extraction based on Conditional Random Fields.” In
International Conference on Computer Processing of Oriental Languages: 290–296. Heidelberg, Berlin: Springer.

Cited by
Cited by 3 other publications
Kister, Laurence & Evelyne Jacquey
2022.
Identification d’occurrences de candidats termes dans des articles scientifiques.
Corela :20-1

Lei, Lei, Yaochen Deng & Dilin Liu
2023.
Examining research topics with a dependency-based noun phrase extraction method: a case in accounting.
Library Hi Tech 41:2
► pp. 570 ff.

Martín-Chozas, Patricia, Karen Vázquez-Flores, Pablo Calleja, Elena Montiel-Ponsoda, Víctor Rodríguez-Doncel, Julia Bosque-Gil, Milan Dojchinovski, Philipp Cimiano, Julia Bosque-Gil, Philipp Cimiano & Milan Dojchinovski
2022.
TermitUp: Generation and enrichment of linked terminologies.
Semantic Web 13:6
► pp. 967 ff.

This list is based on CrossRef data as of 1 december 2023. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.