The case of automatic translation of multiword expressions using comparable corpora: What matters more: The size of the corpora or their quality?

Mitkov, Ruslan; Taslimipoor, Shiva

doi:10.1075/ivitra.24.09mit

Part of

Computational Phraseology
Edited by Gloria Corpas Pastor and Jean-Pierre Colson
[IVITRA Research in Linguistics and Literature 24] 2020
► pp. 177–188

What matters more: The size of the corpora or their quality?

The case of automatic translation of multiword expressions using comparable corpora

Ruslan Mitkov | University of Wolverhampton | [email protected]

Shiva Taslimipoor | University of Wolverhampton | [email protected]

This study investigates (and compares) the impact of the size and the similarity/quality of comparable corpora on the specific task of extracting translation equivalents of verb-noun collocations from such corpora. The comprehensive evaluation of different configurations of English and Spanish corpora sheds some light on the more general and perennial question: what matters more – the quantity or quality of corpora?

Keywords: multiword expressions, automatic translation, comparable corpora, size of corpora, vector representations

Article outline

1.Rationale
2.Our methodology for translating multiword expressions
3.Data and experiments
- 3.1Comparable corpora
- 3.2Data
- 3.3Vector representations
- 3.4Gold standard
4.Comparable corpora and translation of mwes: Size vs. quality
5.Conclusion
Notes
References

Published online: 8 May 2020

https://doi.org/10.1075/ivitra.24.09mit

References (28)

References

Corpas Pastor, G., Mitkov, R., Afzal, N., & Pekar, V. (2008). Translation universals: do they exist? A corpus-based NLP study of convergence and simplification. In Proceedings of the AMTA’ 2008 conference (pp. 75–81). Honolulu, Hawaii.

Daille, B., & Morin, E. (2005). French-English terminology extraction from comparable corpora. In Proceedings of 2nd International Joint Conference on Natural Language Processing (pp. 707–718).

Fung, P., & McKeown, K. (1997). Finding terminology translations from non-parallel corpora. In Proceedings of the 5th Annual Workshop on Very Large Corpora (pp. 192–202).

Gaussier, E., Renders, J.-M., Matveeva, I., Goutte, C., & Hervé, D. (2004). A geometric view on bilingual lexicon extraction from comparable corpora. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (pp. 527–534). Barcelona, Spain.

Gamallo, P., & Pichel, J. R. (2007). Un método de extracción de equivalentes de traducción a partir de un corpus comparable castellano-gallego. Procesamiento del Lenguaje Natural, 39, 241–248.

Green, S., Nicholas, A., Gormley, M., Dredze, M., & Manning, C. D. (2011). Cross-lingual Coreference Resolution: A New Task for Multilingual Comparable Corpora. Technical Report 6, Johns Hopkins University.

Harris, Z. S. (1954). Distributional structure. Word, 10(23), 146–162.

Jagarlamudi, J., & Hal, D. III. (2010). Extracting multilingual topics from unaligned comparable corpora. In ECIR (pp. 444–456).

Klementiev, A., & Roth, D. (2006). Weakly supervised named entity transliteration and discovery from multilingual comparable corpora. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (COLING-ACL-06) (pp. 817–824), Sydney, Australia.

Levy, O., & Goldberg, Y. (2014). Dependency-based word embeddings. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (pp. 302–308). Baltimore, Maryland.

Mikolov, T., Le, Q. V., & Sutskever, I. (2013a). Exploiting similarities among languages for machine translation. CoRR, abs/1309.4168.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013b). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111–3119).

Mitkov, R., Pekar, V., Blagoev, D., & Mulloni, A. (2008). Methods for extracting and classifying pairs of cognates and false friends. Machine Translation, 21(1), 29–53.

Pekar, V., Mitkov, R., Blagoev, D., & Mulloni, A. (2008). Finding Translations for Low-Frequency Words in Comparable Corpora. Machine Translation, 20(4), 247–266.

Pinnis, M., Ion, R., Ştefănescu, D., Su, F., Skadiņa, I., Vasiļjevs, A., & Bogdan B. (2012). ACCURAT Toolkit for Multi-Level Alignment and Information Extraction from Comparable Corpora. In Proceedings of the ACL 2012 System Demonstrations (pp. 91–96).

Rapp, R. (1999). Automatic identification of word translations from unrelated English and German corpora. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (pp. 519–526).

Rapp, R., Sharoff, S., & Zweigenbaum, P. (Eds.) (2016). Special Issue: Machine Translation using comparable corpora. Journal of Natural Language Engineering, 22(4).

Saralegi, X., San Vicente, I., & Gurrutxaga, A. (2008). Automatic Extraction of Bilingual Terms from Comparable Corpora in a Popular Science Domain. In Proceedings of LREC 2008 Workshop of Building and Using Comparable Corpora (pp. 27–32). Basque Country.

Sharoff, S., Zweigenbaum, P., & Rapp, R. (2015). BUCC shared task: cross-language document similarity. In Proceedings of the 8th Workshop on Building and Using Comparable Corpora (pp. 74–78). Beijing, China: Association for Computational Linguistics.

Skadiņa, I., Aker, A., Mastropavlos, N., Su, F., Tufiș, D., Verlic, M., Vasiļjevs, A., Babych, B., Clough, P., Gaizauskas, R., Glaros, N., Lestari Paramita, M., & Pinnis, M. (2012). Collecting and Using Comparable Corpora for Statistical Machine Translation. Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’12) (pp. 438–445).

Smith, J. R., Quirk, C., & Toutanova, K. (2010). Extracting parallel sentences from comparable corpora using document level alignment. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (pp. 403–411). Los Angeles, CA: Association for Computational Linguistics.

Su, F., & Bogdan, B. (2012). Measuring Comparability of Documents in Non-Parallel Corpora for Efficient Extraction of (Semi)Parallel Translation Equivalents. In Proceedings of the EACL’12 Joint Workshop on Exploiting Synergies between Information Retrieval and Machine Translation (ESIRMT) and Hybrid Approaches to Machine Translation (HyTra) (pp. 10–19).

Štajner, S., Mitkov, R., & Leech, G. (2013). Natural Language Processing Methodology for Tracking Diachronic Changes in the 20th Century English Language. Journal of Research Design and Statistics in Linguistics and Communication Science, 1(1), 71–112.

Taslimipoor, S., Mitkov, R., Corpas Pastor, G., & Fazly, A. (2016). Bilingual Contexts from Comparable Corpora to Mine for Translations of Collocations. In Lecture Notes in Computational Linguistics. Proceedings of 17th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing2016). Springer.

Taslimipoor, S., Rohanian, O., Mitkov, R., & Fazly, A. (2017). Investigating the opacity of verb-noun multiword expression usages in context. In Proceedings of the 13th workshop on multiword expressions (MWE 2017) (pp. 133–138). Valencia, Spain: Association for Computational Linguistics.

Udupa, R., Saravanan, K., Kumaran, A., & Jagarlamudi, J. (2008). Mining Named Entity Transliteration Equivalents from Comparable Corpora. In Proceedings of the 17th ACM conference on Information and knowledge management (pp. 1423–1424).

Vulić, I., & Moens, M.-F. (2012). Detecting highly confident word translations from comparable corpora without any prior knowledge. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL) (pp. 449–459).

Zhang, M., Peng, H., Liu, Y., Luan, H., & Sun, M. (2017). Bilingual Lexicon Induction From Non-Parallel Data With Minimal Supervision. In Proceedings of the Thirty-First {AAAI} Conference on Artificial Intelligence (pp. 3379–3385). San Francisco, CA.