Bilingual term recognition revisited: The bag-of-equivalents term alignment approach and its evaluation

Vintar, Spela

doi:10.1075/term.16.2.01vin

Article published In:

Terminology
Vol. 16:2 (2010) ► pp.141–158

Bilingual term recognition revisited

The bag-of-equivalents term alignment approach and its evaluation

Spela Vintar

The paper describes LUIZ, a bilingual term recognition system that has been developed for the Slovene-English language pair. The system is a hybrid term extractor using morphosyntactic patterns and statistical ranking to propose domain-specific expressions for each of the two languages, whereupon translation equivalents between the languages are identified using the innovative bag-of-equivalents approach. This simple but effective method is based on the Twente word aligner to obtain a lexicon of single word translation pairs and their probability scores, which is then used to identify correspondences between multi-word terms. The bilingual term recognition system has been tested and evaluated on three parallel subcorpora from the tourism, accounting and military domain. Average precision of the term alignment component is 0.83, whereby only fully equivalent and domain-relevant terms were counted as positives. Another advantage of the described approach is the fact that we successfully detect term variants and multiple translations of a candidate multi-word term. Since our term alignment method does not require sentence-aligned corpora it can be used with comparable corpora, provided we already have a domain-specific lexicon or dictionary of single-word correspondences. The paper concludes with some thoughts on the users of term recognition systems and their needs based on our observations from the online version of the system.

Keywords: term alignment, ATR evaluation, bilingual term recognition, parallel corpora, word alignment, comparable corpora

Published online: 3 December 2010

https://doi.org/10.1075/term.16.2.01vin

Cited by

Cited by 20 other publications

Order by:

Amjadian, Ehsan, Diana Inkpen, T. Sima Paribakht & Farahnaz Faez

2018. Distributed specificity for automatic terminology extraction. Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication 24:1 ► pp. 23 ff.

Andersen, Gisle

2022. Utilising heterogeneous language resources for term extraction in maritime domains. Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication 28:1 ► pp. 1 ff.

Clouet, Elizaveta, Rima Harastani, Béatrice Daille & Emmanuel Morin

2015. Compositional translation of single-word complex terms using multilingual splitting. Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication 21:2 ► pp. 263 ff.

Croijmans, Ilja, Iris Hendrickx, Els Lefever, Asifa Majid & Antal Van Den Bosch

2020. Uncovering the language of wine experts. Natural Language Engineering 26:5 ► pp. 511 ff.

Harastani, Rima, Béatrice Daille & Emmanuel Morin

2012. Neoclassical Compound Alignments from Comparable Corpora. In Computational Linguistics and Intelligent Text Processing [Lecture Notes in Computer Science, 7182], ► pp. 72 ff.

Hellrich, Johannes & Udo Hahn

2014. Enhancing Multilingual Biomedical Terminologies via Machine Translation from Parallel Corpora. In Natural Language Processing and Information Systems [Lecture Notes in Computer Science, 8455], ► pp. 9 ff.

Hoste, Veronique, Klaar Vanopstal, Ayla Rigouts Terryn & Els Lefever

2019. The Trade-off between Quantity and Quality. Comparing a Large Crawled Corpus and a Small Focused Corpus for Medical Terminology Extraction. Across Languages and Cultures 20:2 ► pp. 197 ff.

Hörberg, Thomas, Maria Larsson & Jonas K. Olofsson

2022. The Semantic Organization of the English Odor Vocabulary. Cognitive Science 46:11

Logar Berginc, Nataša & Dejan Verčič

2013. Terminological databanks as the bodies of knowledge: Slovenian public relations terminology. Public Relations Review 39:5 ► pp. 569 ff.

Pinnis, Mārcis, Nikola Ljubešić, Dan Ştefănescu, Inguna Skadiņa, Marko Tadić, Tatjana Gornostaja, Špela Vintar & Darja Fišer

2019. Extracting Data from Comparable Corpora. In Using Comparable Corpora for Under-Resourced Areas of Machine Translation [Theory and Applications of Natural Language Processing, ], ► pp. 89 ff.

Rackevičienė, Sigita, Giedrė Valūnaitė Oleškevičienė & Klaudija Cheiker

2020. Terminology in Media Discourse: A Case Study of Terms Denoting Phobia Types in English, Lithuanian and Norwegian News Media Sites. Research in Language 18:4 ► pp. 359 ff.

Repar, Andraž, Matej Martinc & Senja Pollak

2020. Reproduction, replication, analysis and adaptation of a term alignment approach. Language Resources and Evaluation 54:3 ► pp. 767 ff.

Repar, Andraž, Vid Podpečan, Anže Vavpetič, Nada Lavrač & Senja Pollak

2022. TermEnsembler. Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication ► pp. 93 ff.

Rigouts Terryn, Ayla, Véronique Hoste & Els Lefever

2020. In no uncertain terms: a dataset for monolingual and multilingual automatic term extraction from comparable corpora. Language Resources and Evaluation 54:2 ► pp. 385 ff.

Rigouts Terryn, Ayla, Véronique Hoste & Els Lefever

2021. HAMLET. Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication 27:2 ► pp. 254 ff.

Rigouts Terryn, Ayla, Véronique Hoste & Els Lefever

2022. Tagging terms in text. Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication 28:1 ► pp. 157 ff.

Tran, Hanh Thi Hong, Matej Martinc, Antoine Doucet & Senja Pollak

2022. Can Cross-Domain Term Extraction Benefit from Cross-lingual Transfer?. In Discovery Science [Lecture Notes in Computer Science, 13601], ► pp. 363 ff.

Tran, Hanh Thi Hong, Matej Martinc, Andraz Pelicon, Antoine Doucet & Senja Pollak

2022. Ensembling Transformers for Cross-domain Automatic Term Extraction. In From Born-Physical to Born-Virtual: Augmenting Intelligence in Digital Libraries [Lecture Notes in Computer Science, 13636], ► pp. 90 ff.

Vivaldi, Jorge & Iria da Cunha

2012. QA

INEX Track 2011: Question Expansion and Reformulation Using the REG Summarization System. In Focused Retrieval of Content and Structure [Lecture Notes in Computer Science, 7424], ► pp. 257 ff.

[no author supplied]

2014. Bibliography. In Comparable Corpora and Computer‐Assisted Translation, ► pp. 277 ff.

This list is based on CrossRef data as of 9 june 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.