Extracting bilingual terms from the Web
In this paper we make two contributions. First, we describe a multi-component system called BiTES (Bilingual Term Extraction System) designed to automatically gather domain-specific bilingual term pairs from Web data. BiTES components consist of data gathering tools, domain classifiers, monolingual text extraction systems and bilingual term aligners. BiTES is readily extendable to new language pairs and has been successfully used to gather bilingual terminology for 24 language pairs, including English and all official EU languages, save Irish. Second, we describe a novel set of methods for evaluating the main components of BiTES and present the results of our evaluation for six language pairs. Results show that the BiTES approach can be used to successfully harvest quality bilingual term pairs from the Web. Our evaluation method delivers significant insights about the strengths and weaknesses of our techniques. It can be straightforwardly reused to evaluate other bilingual term extraction systems and makes a novel contribution to the study of how to evaluate bilingual terminology extraction systems.
References
Agarwal, Basant, and Namita Mittal
2014 “
Text Classification Using Machine Learning Methods - A Survey.” In
Proceedings of the 2nd International Conference on Soft Computing for Problem Solving (SocProS 2012)
, 701–709. New Delhi: Springer.
Aker, Ahmet, Yang Feng, and Robert J. Gaizauskas
2012a “
Automatic Bilingual Phrase Extraction from Comparable Corpora.” In
Proceedings of the 24th International Conference on Computational Linguistics (Posters) (COLING 2012), 23–32. Bombay: The COLING 2012 Organizing Committee.
Aker, Ahmet, Evangelos Kanoulas, and Robert J. Gaizauskas
2012b “
A Light Way to Collect Comparable Corpora from the Web.” In
Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012)
, 15–20. Istanbul: European Language Resources Association (ELRA).
Aker, Ahmet, Monica Lestari Paramita, Emma Barker, and Robert Gaizauskas
2014 “
Bootstrapping Term Extractors for Multiple Languages.” In
Proceedings of the 9th International Conference on Language Resources and Evaluation Conference (LREC 2014), 483–489. Reykjavik: European Language Resources Association.
Aker, Ahmet, Monica Paramita, and Robert Gaizauskas
2013 “
Extracting Bilingual Terminologies from Comparable Corpora.” In
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL 2013)
, 402–411. Sofia: Association for Computational Linguistics.
Al-Onaizan, Yaser, and Kevin Knight
2002 “
Machine Transliteration of Names in Arabic Text.” In
Proceedings of the ACL-02 Workshop on Computational Approaches to Semitic Languages
, 1–13. Stroudsburg: Association for Computational Linguistics.

Aswani, Niraj, and Robert Gaizauskas
2010 “
English-Hindi Transliteration Using Multiple Similarity Metrics.” In
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010)
, 1786–1793. Valetta: European Language Resources Association (ELRA).
Bouamor, Dhouha, Nasredine Semmar, and Pierre Zweigenbaum
2012 “
Identifying Bilingual Multi-Word Expressions for Statistical Machine Translation.” In
Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012)
, 674–679. Istanbul: European Language Resources Association (ELRA).
Cao, Yunbo, and Hang Li
2002 “
Base Noun Phrase Translation Using Web Data and the EM Algorithm.” In
Proceedings of the 19th International Conference on Computational Linguistics - Volume 1
, 1–7. Stroudsburg: Association for Computational Linguistics.
Daille, Bĕatrice, Ĕric Gaussier, and Jean-Marc Lange
1994 “
Towards Automatic Extraction of Monolingual and Bilingual Terminology.” In
Proceedings of the 15th Conference on Computational Linguistics - Volume 1
, 515–521. Stroudsburg: Association for Computational Linguistics.

De Benedictis, Flavio, Stefano Faralli, and Robert Navigli
2013 “
Glossboot: Bootstrapping Multilingual Domain Glossaries from the Web.” In
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL 2013)
, 528–538. Sofia: Association for Computational Linguistics.
Drouin, Parick
2004 “
Detection of Domain Specific Terminology Using Corpora Comparison.” In
Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC 2004)
, 79–82. Lisbon: European Language Resources Association (ELRA).
EuroTermBank
2015 EuroTermBank (
[URL]). Accessed 15 September 2015.

EuroVoc
2015 “EuroVoc, the EU’s Multilingual Thesaurus. Thesaurus Eurovoc - Volume 2: Subject-Oriented Version. Ed. 3/English Language.” Annex to the index of the Official Journal of the EC. Luxembourg, Office for Official Publications of the European Communities (
[URL]). Accessed 15 September 2015.

Fan, Xiaorong, Nobuyuki Shimizu, and Hiroshi Nakagawa
2009 “
Automatic Extraction of Bilingual Terms from a Chinese-Japanese Parallel Corpus.” In
Proceedings of the 3rd International Universal Communication Symposium (IUCS ’09)
, 41–45. New York: Association for Computing Machinery (ACM).

Fung, Pascale, and Kathleen McKeown
1997 “
Finding Terminology Translations from Non-Parallel Corpora.” In
Proceedings of the 5th Annual Workshop on Very Large Corpora
, 192–202. Hong Kong: Association for Computational Linguistics.
Gaizauskas, Robert, Emma Barker, Monica Lestari Paramita, and Ahmet Aker
2014 “
Assigning Terms to Domains by Document Classification.” In
Proceedings of the 4th International Workshop on Computational Terminology (Computerm)
, 11–21. Dublin: Association for Computational Linguistics and Dublin City University.

Gornostay, Tatiana, and Andrejs Vasiljevs
2014 “
Terminology Resources and Terminology Work Benefit from Cloud Services.” In
Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014)
, 1943–1948. Reykjavik: European Language Resources Association (ELRA).
Grishman, Ralph, and Beth Sundheim
1996 “
Message Understanding Conference - 6: A Brief History.” In
Proceedings of the 16th International Conference on Computational Linguistics
, 466–471. Copenhagen: Association for Computational Linguistics.

IATE
2015 InterActive Terminology for Europe (
[URL]). Accessed 15 September 2015.

Ismail, Azniah, and Suresh Manandhar
2010 “
Bilingual Lexicon Extraction from Comparable Corpora Using In-Domain Terms.” In
Proceedings of the 23rd International Conference on Computational Linguistics: Poster (COLING 2010), 481–489. Beijing: COLING 2010 Organizing Committee.
Kida, Mitsuhiro, Masatsugu Tonoike, Takehito Utsuro, and Satoshi Sato
2007 “
Domain Classification of Technical Terms Using the Web.”
Systems and Computers in Japan 38 (14): 11–19.


Kilgariff, Adam, Miloš Jakubıcek, Vojtěch Kovár, Pavel Rychlý, and Vít Suchomel
2014 “
Finding Terms in Corpora for Many Languages with the Sketch Engine.” In
Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2014)
, 53–56. Gothenburg: Association for Computational Linguistics.
Kim, Su Nam, Timothy Baldwin, and Min-Yen Kan
2009 “
An Unsupervised Approach to Domain-Specific Term Extraction.” In
Proceedings of the Australasian Language Technology Association Workshop
, 94–98. Sydney: Australasian Language Technology Association.
Knight, Kevin, and Jonathan Graehl
1998 “
Machine Transliteration.”
Computational Linguistics 24 (4): 599–612.

Kupiec, Julian
1993 “
An Algorithm for Finding Noun Phrase Correspondences in Bilingual Corpora.” In
Proceedings of the 31st Annual Meeting on Association for Computational Linguistics (ACL 1993)
, 17–22. Columbus: Association for Computational Linguistics.

Manning, D. Christopher, Prabhakar Raghavan, and Hinrich Schütze
2008 Introduction to Information Retrieval. Cambridge: Cambridge University Press.


Marciniak, Małgorzata, and Agnieszka Mykowiecka
2013 “
Terminology Extraction from Domain Texts in Polish.” In
Intelligent Tools for Building a Scientific Information Platform
, 171–185. Berlin, Heidelberg: Springer.

Mastropavlos, Nikos, and Vassilis Papavassiliou
2011 “
Automatic Acquisition of Bilingual Language Resources.” In
Proceedings of the 10th International Conference of Greek Linguistics (ICGL 2011). Komotini, Greece.
Morin, Emmanuel,Béatrice Daille, Koichi Takeuchi, and Kyo Kageura
2007 “
Bilingual Terminology Mining Using Brain, not Brawn Comparable Corpora.” In
Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics (ACL 2007)
, 664–671. Prague: Association for Computational Linguistics.
Okita, Tsuyoshi, Alfredo Maldonado-Guerra, Yvette Graham, and Andy Way
2010 “
Multi-Word Expression-Sensitive Word Alignment.” In
Proceedings of the 4th International Workshop on Cross Lingual Information Access (CLIA 2010), 26–34. Beijing: COLING 2010 Organizing Committee.
Paramita, Monica Lestari, Paul Clough, Ahmet Aker, and Robert J. Gaizauskas
2012 “
Correlation Between Similarity Measures for Inter-Language Linked Wikipedia Articles.” In
Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012)
, 790–797. Istanbul: European Language Resources Association.
Pazienza, Maria Teresa, Marco Pennacchiotti, and Fabio Massimo Zanzotto
2005 “
Terminology Extraction: An Analysis of Linguistic and Statistical Approaches.” In
Knowledge Mining, 255–279. Berlin, Heidelberg: Springer.


Pinnis, Mārcis
2014 “
Bootstrapping of a Multilingual Transliteration Dictionary for European Languages.” In
Human Language Technologies The Baltic Perspective - Proceedings of the 6th International Conference Baltic (HLT 2014)
, 132–140. Amsterdam: IOS Press.
Pinnis, Mārcis, Nikola Ljubešic, Dan Stefanescu, Inguna Skadina, Marko Tadic, and Tatiana Gornostay
2012 “
Term Extraction, Tagging, and Mapping Tools for Under-Resourced Languages.” In
Proceedings of the 10th Conference on Terminology and Knowledge Engineering (TKE 2012), 20–21. Madrid.
Rapp, Reinhard
1995 “
Identifying Word Translations in Non-Parallel Texts.” In
Proceedings of the 33rd Annual Meeting on Association for Computational Linguistics (ACL 1995)
, 320–322. Cambridge, Massachusetts: Association for Computational Linguistics.

Resnik, Philip, and Noah A. Smith
2003 “
The Web as a Parallel Corpus.”
Computational Linguistics 29 (3): 349–380.


Sang, Erik F. Tjong Kim, and Fien De Meulder
2003 “
Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition.” In
Proceedings of the 7th Conference on Natural Language Learning at HLT-NAACL 2003 – Volume 4
, 142–147. Edmonton: Association for Computational Linguistics.

Sebastiani, Fabrizio
2002 “
Machine learning in automated text categorization”.
ACM Computing Surveys 341: 1–47.


Spärck Jones, Karen
1972 “A Statistical Interpretation of Term Specificity and Its Application in Retrieval.” Journal of Documentation 281: 11–21.


Steinberger, Ralf, Bruno Pouliquen, and Johan Hagman
2002 “
Cross-Lingual Document Similarity Calculation Using the Multilingual Thesaurus EuroVoc.”
Computational Linguistics and Intelligent Text Processing, 415–424. Berlin, Heidelberg: Springer.


Udupa, Raghavendra, K. Saravanan, A. Kumaran, and Jagadeesh Jagarlamudi
2008 “
Mining Named Entity Transliteration Equivalents from Comparable Corpora.” In
Proceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM 2008), 1423–1424. New York: Association for Computing Machinery.
Wikipedia
2014 “
Hydraulic Fracturing.” (
[URL]). Accessed 23 June 2015.
Cited by
Cited by 3 other publications
Awwad, Hasna, Majdi Sawalha, Areej Allawzi & Sane Yagi
2022.
Building translator-oriented English-Arabic physics glossary from domain corpus.
International Journal of Speech Technology 
Tien, Ha Nguyen, Quyen Ngo The, Huyen Nguyen Thi Minh & Linh Ha My
2019.
Proceedings of the Tenth International Symposium on Information and Communication Technology - SoICT 2019,
► pp. 56 ff.

Wang, Zheng
2022.
Using search engines as a retrieval tool for translating newly coined expressions and terminology between Chinese and English.
Digital Scholarship in the Humanities 
This list is based on CrossRef data as of 25 february 2023. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.