Strategies for building high quality bilingual lexicons from comparable corpora

Gamallo, Pablo

doi:10.1075/scl.90.15gam

Part of

Parallel Corpora for Contrastive and Translation Studies: New resources and applications
Edited by Irene Doval and M. Teresa Sánchez Nieto
[Studies in Corpus Linguistics 90] 2019
► pp. 251–265

Strategies for building high quality bilingual lexicons from comparable corpora

Pablo Gamallo | University of Santiago de Compostela

This chapter outlines two strategies to automatically build bilingual dictionaries: One is based on the use of a pivot language and existing bilingual dictionaries, while the other relies on string similarity and cognate extraction. Both strategies have in common the use of translation equivalents extracted from comparable corpora to filter out odd bilingual pairs and validate the correct ones. The correctness of the entries validated with comparable corpora is very high, close to that achieved by using parallel corpora. The chapter reports several case studies describing how to build new high-quality bilingual lexicons, namely English-Galician, English-Portuguese, and Portuguese-Spanish dictionaries with more than 90% precision. This outperforms state-of-the-art systems on bilingual extraction from comparable corpora, whose best scores hardly reach 70 or 80%.

Keywords: comparable corpora, extraction of translation candidates, bilingual lexicons, distributional similarity, cognates

Article outline

1.Introduction
2.Pruning lexicons built through transitiviy
- 2.1Basic assumptions
- 2.2The pruning method
3.Pruning bilingual cognates
- 3.1Basic assumptions
- 3.2The pruning method
4.Experiments
- 4.1Derivation by transitivity
- 4.2Comparable corpora
  - 4.2.1Validation
  - 4.2.2Evaluation of the dictionaries generated by transitivity
- 4.3Bilingual cognates
  - 4.3.1Existing resources
  - 4.3.2Size of the extracted lexicons
  - 4.3.3Evaluation of the cognate-based extraction
  - 4.3.4Error analysis
5.Conclusions and future work
Acknowledgements
Notes
References

Published online: 20 March 2019

https://doi.org/10.1075/scl.90.15gam

References (17)

References

Aker, Ahmet, Paramita, Monica & Gaizauskas, Robert. 2013. Extracting bilingual terminologies from comparable corpora. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, 4–9, Sofia, Bulgaria.

Ansari, Ebrahim, Sadreddini, Mohammad H., Tabebordbar, Alireza & Mehdi, Sheikhalishahi. 2014. Combining different seed dictionaries to extract lexicon from comparable corpus. Indian Journal of Science and Technology, 7(9):1279–1288.

Armentano-Oller, Carme, Carrasco, Rafael C., Corbí-Bellot, Antonio M., Forcada, Mikel L. Mireia, Rosell, Ginestí, Ortiz-Rojas, Sergio, Pérez-Ortiz, Juan Antonio, Ramírez-Sánchez, Gema, Sánchez-Martínez, Felipe & Scalco, Miriam A. 2006. Open-source Portuguese–Spanish machine translation. Lecture Notes in Computer Science 3960, 50–59.

Gamallo, Pablo. 2007. Learning bilingual lexicons from comparable English and Spanish corpora. In Machine Translation SUMMIT XI, 191–198, Copenhagen, Denmark. <[URL]>

Gamallo, Pablo & Garcia, Marcos. 2012. Extraction of bilingual cognates from Wikipedia. Lecture Notes in Computer Science 7243: 63–72.

Gamallo, Pablo & González, Isaac. 2011a. A grammatical formalism based on patterns of part-of-speech tags. International Journal of Corpus Linguistics 16(1): 45–71.

. 2011b. Measuring comparability of multilingual corpora extracted from Wikipedia. In Workshop on Iberian Cross-Language NLP tasks (ICL-2011), 8–13, Huelva, Spain.

Gamallo, Pablo & Pichel, José Ramom. 2008. Learning Spanish–Galician translation equivalents using a comparable corpus and a bilingual dictionary. Lecture Notes in Computer Science 4919: 413–423.

Gamallo, Pablo & Pichel, José Ramón. 2010. Automatic generation of bilingual dictionaries using intermediary languages and comparable corpora. In CICLING, LNCS, Vol. 6008, 473–483, Iasi, Romania. Heidelberg: Springer.

Hazem, Amir & Morin, Emmanuel. 2014. Improving bilingual lexicon extraction from comparable corpora using window-based and syntax-based models. Lecture Notes in Computer Science 8404: 310–323.

Nakagawa, Hiroshi. 2001. Disambiguation of single noun translations extracted from bilingual comparable corpora. Terminology 7(1), 63–83.

Nerima, Luka & Wehrli, Eric. 2008. Generating bilingual dictionaries by transitivity. In LREC’08, 2584–2587, Marrakesh, Marocco.

Rapp, Reinhard. 1999. Automatic identification of word translations from unrelated English and German Corpora. In Proceedings of ACL’99, 519–526.

Saralegi, Xabier, Manterola, Iker & San Vicente, Iñaki. 2011. Analyzing methods for improving precision of pivot-based bilingual dictionaries. In Empirical Methods in Natural Language Processing (EMNLP-2011), 846–856, Edinburgh, Scotland, UK.

. 2012. Building a Basque-Chinese dictionary by using English as pivot. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC-2012), 1443–1447, Istanbul, Turkey.

Tamura, Akihiro, Watanabe, Taro & Sumita, Eiichiro. 2012. Bilingual lexicon extraction from comparable corpora using label propagation. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 24–36, Jeju Island, Korea.

Wehrli, Eric, Nerima, Luka & Scherrer, Yves. 2009. Deep linguistic multilingual translation and bilingual dictionaries. In 4th Workshop on Statistical Machine Translation, 90–94, Athens, Greece.

Cited by (2)

Cited by two other publications

Mikhailov, Mikhail

2021. Mind the Source Data! Translation Equivalents and Translation Stimuli from Parallel Corpora. In New Perspectives on Corpus Translation Studies [New Frontiers in Translation Studies, ], ► pp. 259 ff.

Garcia, Marcos, Marcos García-Salido & Margarita Alonso-Ramos

2019. Discovering bilingual collocations in parallel corpora. In Parallel Corpora for Contrastive and Translation Studies [Studies in Corpus Linguistics, 90], ► pp. 267 ff.

This list is based on CrossRef data as of 27 july 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.