Creating a test corpus for term extractors through term annotation

Bernier-Colborne, Gabriel; Drouin, Patrick

doi:10.1075/term.20.1.03ber

Article published In:

Terminology
Vol. 20:1 (2014) ► pp.50–73

Creating a test corpus for term extractors through term annotation

Gabriel Bernier-Colborne

Patrick Drouin

In this paper, we describe a methodology used to create a test corpus for the evaluation of term extractors. This methodology relies on term annotation: terms in a corpus on automotive engineering are selected based on specific criteria pertaining to the terminological setting as well as linguistic and formal properties of terms and term variations. The test corpus accounts for the variety of ways in which terms are realized in running text, and provides a means of automatically evaluating the relevance of term candidate lists produced by term extractors. Due to the XML annotation scheme used, the corpus can be customized, e.g. by filtering out some of the annotated terms based on the type of term or term variation, or frequency. In this paper, we focus on the methodological aspects of this work.

Keywords: term extractor evaluation, corpus annotation, test corpus, term extraction, evaluation, term variation, terminological variation

Published online: 25 April 2014

https://doi.org/10.1075/term.20.1.03ber

References (23)

Ahmad, Khurshid, Andrea Davies, Heather Fulford, and Margaret Rogers. 1994. “What Is a Term? The Semi-Automatic Extraction of Terms from Text.” In Translation Studies: An Interdiscipline, ed. by Mary Snell-Hornby, Franz Pöchhacker, and Klaus Kaindl, 267–278. Amsterdam: John Benjamins.

Bernier-Colborne, Gabriel. 2012. Élaboration d’un corpus étalon pour l’évaluation d’extracteurs de termes [Creating a Test Corpus for the Evaluation of Term Extractors]. MA thesis, Université de Montréal.

Cabré, Maria-Teresa, Anne Condamines, and Fidelia Ibekwe-SanJuan. 2005. “Introduction: Application-Driven Terminology Engineering.” Terminology 11 (1): 1–19.

Carl, Michael, Ecaterina Rascu, Johann Haller, and Philippe Langlais. 2004. “Abducing Term Variant Translations in Aligned Texts.” Terminology 10 (1): 101–130.

Carreño Cruz, Sahara I. 2004. Analyse de la variation terminologique en corpus parallèle anglais-espagnol et de son incidence sur l’extraction de termes bilingue [Analysis of Term Variation in an English-Spanish Parallel Corpus and its Influence on Bilingual Term Extraction]. MA thesis, Université de Montréal.

Collet, Tanja. 1997. “La réduction des unités terminologiques complexes de type syntagmatique [The Reduction of Complex Terms].” Meta: journal des traducteurs 42 (1): 193–206.

Cohen, K. Bretonnel, Lynne Fox, Philip V. Ogren, and Lawrence Hunter. 2005. “Corpus Design for Biomedical Natural Language Processing.” In Proceedings of the ACL-ISMB Workshop on Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics , 38–45. Association for Computational Linguistics.

Daille, Béatrice. 1996. “Study and Implementation of Combined Techniques for Automatic Extraction of Terminology.” In The Balancing Act: Combining Symbolic and Statistical Approaches to Language, ed. by Judith L. Klavans, and Philip Resnik, 49–66. Cambridge: MIT Press.

. 2005. “Variations and Application-oriented Terminology Engineering.” Terminology 11 (1): 181–197.

Estopà, Rosa. 2001. “Les unités de signification spécialisées: élargissant l’objet du travail en terminologie [Units of Specialised Meaning: Broadening the Scope of Terminology Work].” Terminology 7 (2): 217–237.

Fulford, Heather. 2001. “Exploring Terms and Their Linguistic Environment in Text: A Domain-Independent Approach to Automated Term Extraction.” Terminology 7 (2): 259–279.

Haralambous, Yannis, and Elisa Lavagnino. 2011. “La réduction des termes complexes dans les langues de spécialité.” [The Reduction of Multi-word Terms in Specialized Languages] TAL 52 (1): 37–68.

Jacquemin, Christian. 2001. Spotting and Discovering Terms through Natural Language Processing. Cambridge: MIT Press.

Kageura, Kyo, Masaharu Yoshioka, Koichi Takeuchi, Teruo Koyama, Keita Tsuji, and Fuyuki Yoshikane. 2000. “Recent Advances in Automatic Term Recognition: Experiences from the NTCIR Workshop on Information Retrieval and Term Recognition.” Terminology 6 (2): 151–173.

Kano, Yoshinobu, William A. Baumgartner Jr., Luke McCrohon, Sophia Ananiadou, K. Bretonnel Cohen, Lawrence Hunter, and Jun'ichi Tsujii. 2009. “U-Compare: Share and Compare Text Mining Tools with UIMA.” Bioinformatics 25 (15): 1997–1998.

L’Homme, Marie-Claude. 2004. La terminologie: principes et techniques [Terminology: Principles and Techniques]. Montréal: Presses de l’Université de Montréal.

Loginova, Elizaveta, Anita Gojun, Helena Blancafort, Marie Guégan, Tatiana Gornostay, and Ulrich Heid. 2012. “Reference Lists for the Evaluation of Term Extraction Tools.” In Proceedings of the 10th Terminology and Knowledge Engineering Conference (TKE 2012) , 177–192. Madrid.

Love, Stacy. 2000. Benchmarking the Performance of Two Automated Term-Extraction Systems: LOGOS and ATAO. MA thesis, Université de Montréal.

Nazarenko, Adeline, Haïfa Zargayouna, Olivier Hamon, and Jonathan van Puymbrouck. 2009. “Évaluation des outils terminologiques: enjeux, difﬁcultés et propositions [Evaluating Terminology Tools: Issues, Challenges and Proposals].” Traitement automatique des langues 50 (1): 257–281.

Pearson, Jennifer. 1998. Terms in Context. Amsterdam: John Benjamins.

Timimi, Ismaïl, and Widad Mustafa El Hadi. 2008. “CESART: une campagne d’évaluation de systèmes d’acquisition de ressources terminologiques [CESART: An Evaluation Campaign for Terminological Resource Acquisition Systems].” In L’évaluation des technologies en traitement de la langue: Les campagnes Technolangue [Evaluating Natural Language Processing Technologies: The Technolangue Campaigns], ed. by Stéphane Chaudiron, and Khalid Choukry, 71–91. Paris: Hermès.

Vivaldi, Jorge, and Horacio Rodríguez. 2007. “Evaluation of Terms and Term Extraction Systems: A Practical Approach.” Terminology 13 (2): 225–248.

Widlöcher, Antoine, and Yann Mathet. 2009. “La plate-forme Glozz: environnement d’annotation et d’exploration de corpus.” [The Glozz Platform: A Corpus Annotation and Exploration Environment]. Proceedings of Traitement Automatique des Langues Naturelles (TALN) , 2009. Senlis (France).

Cited by (5)

Cited by five other publications

Order by:

Kwong, Oi Yee

2021. User-driven assessment of commercial term extractors. Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication 27:2 ► pp. 179 ff.

Rigouts Terryn, Ayla, Véronique Hoste & Els Lefever

2020. In no uncertain terms: a dataset for monolingual and multilingual automatic term extraction from comparable corpora. Language Resources and Evaluation 54:2 ► pp. 385 ff.

Ljubešić, Nikola, Darja Fišer & Tomaž Erjavec

2019. KAS-term: Extracting Slovene Terms from Doctoral Theses via Supervised Machine Learning. In Text, Speech, and Dialogue [Lecture Notes in Computer Science, 11697], ► pp. 115 ff.

Zeng, Wen, Changqing Yao & Hui Li

2017. The exploration of information extraction and analysis about science and technology policy in China. The Electronic Library 35:4 ► pp. 709 ff.

Astrakhantsev, N. A., D. G. Fedorenko & D. Yu. Turdakov

2015. Methods for automatic term recognition in domain-specific text collections: A survey. Programming and Computer Software 41:6 ► pp. 336 ff.

This list is based on CrossRef data as of 27 september 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.