Chapter published in:
Parallel Corpora for Contrastive and Translation Studies: New resources and applicationsEdited by Irene Doval and M. Teresa Sánchez Nieto
[Studies in Corpus Linguistics 90] 2019
► pp. 251–265
Strategies for building high quality bilingual lexicons from comparable corpora
Pablo Gamallo | University of Santiago de Compostela
This chapter outlines two strategies to automatically build bilingual dictionaries: One is based on the use of a pivot language and existing bilingual dictionaries, while the other relies on string similarity and cognate extraction. Both strategies have in common the use of translation equivalents extracted from comparable corpora to filter out odd bilingual pairs and validate the correct ones. The correctness of the entries validated with comparable corpora is very high, close to that achieved by using parallel corpora. The chapter reports several case studies describing how to build new high-quality bilingual lexicons, namely English-Galician, English-Portuguese, and Portuguese-Spanish dictionaries with more than 90% precision. This outperforms state-of-the-art systems on bilingual extraction from comparable corpora, whose best scores hardly reach 70 or 80%.
Keywords: comparable corpora, extraction of translation candidates, bilingual lexicons, distributional similarity, cognates
Article outline
- 1.Introduction
- 2.Pruning lexicons built through transitiviy
- 2.1Basic assumptions
- 2.2The pruning method
- 3.Pruning bilingual cognates
- 3.1Basic assumptions
- 3.2The pruning method
- 4.Experiments
- 4.1Derivation by transitivity
- 4.2Comparable corpora
- 4.2.1Validation
- 4.2.2Evaluation of the dictionaries generated by transitivity
- 4.3Bilingual cognates
- 4.3.1Existing resources
- 4.3.2Size of the extracted lexicons
- 4.3.3Evaluation of the cognate-based extraction
- 4.3.4Error analysis
- 5.Conclusions and future work
-
Acknowledgements -
Notes -
References
Published online: 20 March 2019
https://doi.org/10.1075/scl.90.15gam
https://doi.org/10.1075/scl.90.15gam
References
Aker, Ahmet, Paramita, Monica & Gaizauskas, Robert
Ansari, Ebrahim, Sadreddini, Mohammad H., Tabebordbar, Alireza & Mehdi, Sheikhalishahi
Armentano-Oller, Carme, Carrasco, Rafael C., Corbí-Bellot, Antonio M., Forcada, Mikel L. Mireia, Rosell, Ginestí, Ortiz-Rojas, Sergio, Pérez-Ortiz, Juan Antonio, Ramírez-Sánchez, Gema, Sánchez-Martínez, Felipe & Scalco, Miriam A.
Gamallo, Pablo
2007 Learning bilingual lexicons from comparable English and Spanish corpora. In Machine Translation SUMMIT XI, 191–198, Copenhagen, Denmark. https://pdfs.semanticscholar.org/5776/8274c94e730a92c5606bd7d7703da12146da.pdf

Gamallo, Pablo & Garcia, Marcos
Gamallo, Pablo & González, Isaac
Gamallo, Pablo & Pichel, José Ramom
Gamallo, Pablo & Pichel, José Ramón
Hazem, Amir & Morin, Emmanuel
Nakagawa, Hiroshi
Nerima, Luka & Wehrli, Eric
Rapp, Reinhard
Saralegi, Xabier, Manterola, Iker & San Vicente, Iñaki
Tamura, Akihiro, Watanabe, Taro & Sumita, Eiichiro
Cited by
Cited by 2 other publications
Garcia, Marcos, Marcos García-Salido & Margarita Alonso-Ramos
Mikhailov, Mikhail
This list is based on CrossRef data as of 01 april 2022. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.