An ensemble learning approach to bilingual term extraction and alignment
This paper describes TermEnsembler, a bilingual term extraction and alignment system utilizing a novel ensemble learning approach to bilingual term alignment. In the proposed system, the processing starts with monolingual term extraction from a language industry standard file type containing aligned English and Slovenian texts. The two separate term lists are then automatically aligned using an ensemble of seven bilingual alignment methods, which are first executed separately and then merged using the weights learned with an evolutionary algorithm. In the experiments, the weights were learned on one domain and tested on two other domains. When evaluated on the top 400 aligned term pairs, the precision of term alignment is over 96%, while the number of correctly aligned multi-word unit terms exceeds 30% when evaluated on the top 400 term pairs.
Keywords: bilingual terminology alignment, terminology extraction, ensemble learning, evolutionary algorithm
Available under the Creative Commons Attribution-NonCommercial (CC BY-NC) 4.0 license.
For any use beyond this license, please contact the publisher at email@example.com.
Published online: 24 July 2019
Ahmad, Khurshid, Lee Gillam, and Lena Tostevin
Aker, Ahmet, Monica Paramita, and Rob Gaizauskas
Amjadian, Ehsan, Diana Inkpen, Tahereh Paribakht, and Farahnaz Faez
Baisa, Vít, Barbora Ulipová, and Michal Cukr
Bird, Steven, Ewan Klein, and Edward Loper
Church, Kenneth Ward, and Patrick Hanks
Conneau, Alexis, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou
2018 “Word Translation Without Parallel Data.” (https://arxiv.org/abs/1710.04087) Accessed 2 February 2019.
Daille, Béatrice, and Emmanuel Morin
Daille, Béatrice, Éric Gaussier, and Jean-Marc Langé
Fortin, Félix-Antoine, François-Michel De Rainville, Marc-André Gardner, Marc Parizeau, and Christian Gagné
Frantzi, Katerina, Sophia Ananiadou, and Hideki Mirna
Haque, Rejwanul, Sergio Penkale, and Andy Way
Hazem, Amir, and Emmanuel Morin
Justeson, John, and Slava Katz
Kageura, Kyo, and Bin Umino
Khan, Muhammad Tahir, Yukun Ma, and Jung-jae Kim
Koehn, Philipp, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan et al.
Landis, Richard, and Gary Koch
Ljubešić, Nikola, and Tomaž Erjavec
Logar, Nataša, Miha Grčar, Marko Brakus, Tomaž Erjavec, Špela Arhar Holdt, and Simon Krek
Macken, Lieve, Els Lefever, and Veronique Hoste
McEnery, Tony, Richard Xiao, and Yukio Tono
Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean
2013 “Efficient Estimation of Word Representations in Vector Space.” (https://arxiv.org/abs/1301.3781) Accessed 10 July 2018.
Neubig, Graham, Taro Watanabe, Eiichiro Sumita, Shinsuke Mori, and Tatsuya Kawahara
Och, Franz Josef, and Hermann Ney
Pollak, Senja, Anže Vavpetič, Janez Kranjc, Nada Lavrač, and Špela Vintar
Repar, Andraž, and Senja Pollak
Schmitz, Klaus Dirk, and Daniela Straub
2016 “Tight Budgets and a Growing Number of Languages Impede Terminology Work.” tcworld magazine for international information management (http://www.tcworld.info/e-magazine/technical-communication/article/tight-budgets-and-a-growing-number-of-languages-impede-terminology-work/). Accessed 24 August 2018.
The British National Corpus, version 3 (BNC XML Edition)
2007 Distributed by Bodleian Libraries, University of Oxford, on behalf of the BNC Consortium. (URL: http://www.natcorp.ox.ac.uk/). Accessed 10 March 2017.
Wang, Rui, Wei Liu, and Chris McDonald
Wermter, Joachim, and Udo Hahn
Zhang, Zigi, Jie Gao, and Fabio Ciravegna
2018 “SemRe-Rank: Incorporating Semantic Relatedness to Improve Automatic Term Extraction Using Personalized PageRank.” (https://arxiv.org/abs/1711.03373) Accessed 7 January 2019.