Strategies for building high quality bilingual lexicons from comparable corpora
This chapter outlines two strategies to automatically build bilingual dictionaries: One is based on the use of a pivot language and existing bilingual dictionaries, while the other relies on string similarity and cognate extraction. Both strategies have in common the use of translation equivalents extracted from comparable corpora to filter out odd bilingual pairs and validate the correct ones. The correctness of the entries validated with comparable corpora is very high, close to that achieved by using parallel corpora. The chapter reports several case studies describing how to build new high-quality bilingual lexicons, namely English-Galician, English-Portuguese, and Portuguese-Spanish dictionaries with more than 90% precision. This outperforms state-of-the-art systems on bilingual extraction from comparable corpora, whose best scores hardly reach 70 or 80%.
Article outline
- 1.Introduction
- 2.Pruning lexicons built through transitiviy
- 2.1Basic assumptions
- 2.2The pruning method
- 3.Pruning bilingual cognates
- 3.1Basic assumptions
- 3.2The pruning method
- 4.Experiments
- 4.1Derivation by transitivity
- 4.2Comparable corpora
- 4.2.1Validation
- 4.2.2Evaluation of the dictionaries generated by transitivity
- 4.3Bilingual cognates
- 4.3.1Existing resources
- 4.3.2Size of the extracted lexicons
- 4.3.3Evaluation of the cognate-based extraction
- 4.3.4Error analysis
- 5.Conclusions and future work
-
Acknowledgements
-
Notes
-
References
References (17)
Aker, Ahmet, Paramita, Monica & Gaizauskas, Robert
2013 Extracting bilingual terminologies from comparable corpora. In
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, 4–9, Sofia, Bulgaria.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Ansari, Ebrahim, Sadreddini, Mohammad H., Tabebordbar, Alireza & Mehdi, Sheikhalishahi
2014 Combining different seed dictionaries to extract lexicon from comparable corpus.
Indian Journal of Science and Technology, 7(9):1279–1288.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Armentano-Oller, Carme, Carrasco, Rafael C., Corbí-Bellot, Antonio M., Forcada, Mikel L. Mireia, Rosell, Ginestí, Ortiz-Rojas, Sergio, Pérez-Ortiz, Juan Antonio, Ramírez-Sánchez, Gema, Sánchez-Martínez, Felipe & Scalco, Miriam A.
2006 Open-source Portuguese–Spanish machine translation.
Lecture Notes in Computer Science 3960, 50–59.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Gamallo, Pablo
2007 Learning bilingual lexicons from comparable English and Spanish corpora. In
Machine Translation SUMMIT XI, 191–198, Copenhagen, Denmark.
[URL]
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Gamallo, Pablo & Garcia, Marcos
2012 Extraction of bilingual cognates from Wikipedia.
Lecture Notes in Computer Science 7243: 63–72.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Gamallo, Pablo & González, Isaac
Gamallo, Pablo & González, Isaac
2011b Measuring comparability of multilingual corpora extracted from Wikipedia. In
Workshop on Iberian Cross-Language NLP tasks (ICL-2011), 8–13, Huelva, Spain.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Gamallo, Pablo & Pichel, José Ramom
2008 Learning Spanish–Galician translation equivalents using a comparable corpus and a bilingual dictionary.
Lecture Notes in Computer Science 4919: 413–423.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Gamallo, Pablo & Pichel, José Ramón
2010 Automatic generation of bilingual dictionaries using intermediary languages and comparable corpora. In
CICLING, LNCS, Vol. 6008, 473–483, Iasi, Romania. Heidelberg: Springer.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Hazem, Amir & Morin, Emmanuel
2014 Improving bilingual lexicon extraction from comparable corpora using window-based and syntax-based models.
Lecture Notes in Computer Science 8404: 310–323.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Nerima, Luka & Wehrli, Eric
2008 Generating bilingual dictionaries by transitivity. In
LREC’08, 2584–2587, Marrakesh, Marocco.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Rapp, Reinhard
1999 Automatic identification of word translations from unrelated English and German Corpora. In
Proceedings of ACL’99, 519–526.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Saralegi, Xabier, Manterola, Iker & San Vicente, Iñaki
2011 Analyzing methods for improving precision of pivot-based bilingual dictionaries. In
Empirical Methods in Natural Language Processing (EMNLP-2011), 846–856, Edinburgh, Scotland, UK.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Saralegi, Xabier, Manterola, Iker & San Vicente, Iñaki
2012 Building a Basque-Chinese dictionary by using English as pivot. In
Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC-2012), 1443–1447, Istanbul, Turkey.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Tamura, Akihiro, Watanabe, Taro & Sumita, Eiichiro
2012 Bilingual lexicon extraction from comparable corpora using label propagation. In
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 24–36, Jeju Island, Korea.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Wehrli, Eric, Nerima, Luka & Scherrer, Yves
2009 Deep linguistic multilingual translation and bilingual dictionaries. In
4th Workshop on Statistical Machine Translation, 90–94, Athens, Greece.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Cited by (2)
Cited by 2 other publications
Mikhailov, Mikhail
2021.
Mind the Source Data! Translation Equivalents and Translation Stimuli from Parallel Corpora. In
New Perspectives on Corpus Translation Studies [
New Frontiers in Translation Studies, ],
► pp. 259 ff.
![DOI logo](//benjamins.com/logos/doi-logo.svg)
Garcia, Marcos, Marcos García-Salido & Margarita Alonso-Ramos
This list is based on CrossRef data as of 26 june 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.