Strategies for building high quality bilingual lexicons from comparable corpora
This chapter outlines two strategies to automatically build bilingual dictionaries: One is based on the use of a pivot language and existing bilingual dictionaries, while the other relies on string similarity and cognate extraction. Both strategies have in common the use of translation equivalents extracted from comparable corpora to filter out odd bilingual pairs and validate the correct ones. The correctness of the entries validated with comparable corpora is very high, close to that achieved by using parallel corpora. The chapter reports several case studies describing how to build new high-quality bilingual lexicons, namely English-Galician, English-Portuguese, and Portuguese-Spanish dictionaries with more than 90% precision. This outperforms state-of-the-art systems on bilingual extraction from comparable corpora, whose best scores hardly reach 70 or 80%.
Article outline
- 1.Introduction
- 2.Pruning lexicons built through transitiviy
- 2.1Basic assumptions
- 2.2The pruning method
- 3.Pruning bilingual cognates
- 3.1Basic assumptions
- 3.2The pruning method
- 4.Experiments
- 4.1Derivation by transitivity
- 4.2Comparable corpora
- 4.2.1Validation
- 4.2.2Evaluation of the dictionaries generated by transitivity
- 4.3Bilingual cognates
- 4.3.1Existing resources
- 4.3.2Size of the extracted lexicons
- 4.3.3Evaluation of the cognate-based extraction
- 4.3.4Error analysis
- 5.Conclusions and future work
-
Acknowledgements
-
Notes
-
References
References (17)
References
Aker, Ahmet, Paramita, Monica & Gaizauskas, Robert. 2013. Extracting bilingual terminologies from comparable corpora. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, 4–9, Sofia, Bulgaria.
Ansari, Ebrahim, Sadreddini, Mohammad H., Tabebordbar, Alireza & Mehdi, Sheikhalishahi. 2014. Combining different seed dictionaries to extract lexicon from comparable corpus. Indian Journal of Science and Technology, 7(9):1279–1288.
Armentano-Oller, Carme, Carrasco, Rafael C., Corbí-Bellot, Antonio M., Forcada, Mikel L. Mireia, Rosell, Ginestí, Ortiz-Rojas, Sergio, Pérez-Ortiz, Juan Antonio, Ramírez-Sánchez, Gema, Sánchez-Martínez, Felipe & Scalco, Miriam A. 2006. Open-source Portuguese–Spanish machine translation. Lecture Notes in Computer Science 3960, 50–59.
Gamallo, Pablo. 2007. Learning bilingual lexicons from comparable English and Spanish corpora. In Machine Translation SUMMIT XI, 191–198, Copenhagen, Denmark. <[URL]>
Gamallo, Pablo & Garcia, Marcos. 2012. Extraction of bilingual cognates from Wikipedia. Lecture Notes in Computer Science 7243: 63–72.
Gamallo, Pablo & González, Isaac. 2011b. Measuring comparability of multilingual corpora extracted from Wikipedia. In Workshop on Iberian Cross-Language NLP tasks (ICL-2011), 8–13, Huelva, Spain.
Gamallo, Pablo & Pichel, José Ramom. 2008. Learning Spanish–Galician translation equivalents using a comparable corpus and a bilingual dictionary. Lecture Notes in Computer Science 4919: 413–423.
Gamallo, Pablo & Pichel, José Ramón. 2010. Automatic generation of bilingual dictionaries using intermediary languages and comparable corpora. In CICLING, LNCS, Vol. 6008, 473–483, Iasi, Romania. Heidelberg: Springer.
Hazem, Amir & Morin, Emmanuel. 2014. Improving bilingual lexicon extraction from comparable corpora using window-based and syntax-based models. Lecture Notes in Computer Science 8404: 310–323.
Nerima, Luka & Wehrli, Eric. 2008. Generating bilingual dictionaries by transitivity. In LREC’08, 2584–2587, Marrakesh, Marocco.
Rapp, Reinhard. 1999. Automatic identification of word translations from unrelated English and German Corpora. In Proceedings of ACL’99, 519–526.
Saralegi, Xabier, Manterola, Iker & San Vicente, Iñaki. 2011. Analyzing methods for improving precision of pivot-based bilingual dictionaries. In Empirical Methods in Natural Language Processing (EMNLP-2011), 846–856, Edinburgh, Scotland, UK.
Saralegi, Xabier, Manterola, Iker & San Vicente, Iñaki. 2012. Building a Basque-Chinese dictionary by using English as pivot. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC-2012), 1443–1447, Istanbul, Turkey.
Tamura, Akihiro, Watanabe, Taro & Sumita, Eiichiro. 2012. Bilingual lexicon extraction from comparable corpora using label propagation. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 24–36, Jeju Island, Korea.
Wehrli, Eric, Nerima, Luka & Scherrer, Yves. 2009. Deep linguistic multilingual translation and bilingual dictionaries. In 4th Workshop on Statistical Machine Translation, 90–94, Athens, Greece.
Cited by (2)
Cited by two other publications
Mikhailov, Mikhail
2021.
Mind the Source Data! Translation Equivalents and Translation Stimuli from Parallel Corpora. In
New Perspectives on Corpus Translation Studies [
New Frontiers in Translation Studies, ],
► pp. 259 ff.
Garcia, Marcos, Marcos García-Salido & Margarita Alonso-Ramos
This list is based on CrossRef data as of 27 july 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.