An analysis of the TED-MWE corpus: Translation asymmetries of multiword expressions in machine translation

Monti, Johanna; Arcan, Mihael; Sangati, Federico

doi:10.1075/ivitra.24.02mon

Part of

Computational Phraseology
Edited by Gloria Corpas Pastor and Jean-Pierre Colson
[IVITRA Research in Linguistics and Literature 24] 2020
► pp. 23–42

Translation asymmetries of multiword expressions in machine translation

An analysis of the TED-MWE corpus

Johanna Monti | Università degli Studi di Napoli “L’Orientale”

Mihael Arcan | Insight Centre for Data Analytics

Federico Sangati | Università degli Studi di Napoli “L’Orientale”

Machine Translation (MT) is now extensively used both as a tool to overcome language barriers on the internet and as a professional tool to translate technical documentation. The technology has rapidly evolved in recent years thanks to the availability of large amounts of data in digital format and in particular parallel corpora, which are used to train Statistical Machine Translation (SMT) tools. The quality of MT has considerably improved but the translation of multiword expressions (MWEs) still represents a big and open challenge, both from a theoretical and a practical point of view (Monti, 2013). We define MWEs as any group of two or more words or terms in a language lexicon that generally conveys a single meaning, such as the Italian expressions anima gemella (soul mate), carta di credito (credit card), acqua e sapone (water and soap), piovere a catinelle (rain cats and dogs). The persistence of mistranslation of MWEs in MT outputs originates from their lexical, syntactic, semantic, pragmatic but also translational idiomaticity. Therefore, there is a need to invest in further research in order to achieve significant improvements MT and translation technologies. In particular, it is important to develop resources, mainly MWE-annotated corpora, which can be used for both MT training and evaluation purposes (Monti and Todirascu, 2016).

This work focuses on the translation asymmetries between English and Italian MWEs, and how they affect the SMT performance. By translation asymmetries we mean the differences which may occur between an MWE in a source language and its equivalent in the target language, like in many-to-many word translations (En. to be in a position to → It. essere in grado di), many-to-one (En. to set free → It. liberare) and ﬁnally one-to-many correspondences (En. overcooked → It. cotto troppo). This chapter describes the evaluation of mistranslations caused by translation asymmetries concerning multiword expressions detected in the TED-MWE corpus (http://tiny.cc/TED_MWE), which contains 1,500 sentences and 31,000 EN tokens. This corpus is a subset of the TED spoken corpus (Monti et al., 2015) annotated with all the MWEs detected during the evaluation process. The corpus contains the following information: (i) the English source text, (ii) the Italian human translations (from the parallel corpus), and (iii) the Italian SMT output. All the annotators were Italian native speakers with a good knowledge of the English language and with a background in linguistics and computational linguistics. They were asked to identify all MWEs in the source text together with their translations in approximately 300 random sentences each and to evaluate the automatic translation correctness. The identified MWEs and the evaluation of both the human and the machine translation are also recorded in the corpus. This chapter will discuss (i) the related work concerning the impact of anisomorphism (the absence of an exact correspondence between words in two different languages) and the consequent translation asymmetries on MWEs translation quality in MT, (ii) the corpus, (iii) the annotation guidelines, (iv) the methodology adopted during the annotation process (Monti et al., 2015), (v) the results of the annotation and finally (vi) the evaluation of translation asymmetries in the corpus and ideas for future work.

Keywords: machine translation, translation asymmetries, multiword expressions, TED-MWE corpus

Article outline

1.Introduction
2.Related work
3.The TED-MWE corpus
4.The annotation guidelines
5.The annotation methodology
- Individual annotation
- Inter-annotation validation
- Evaluation
6.The results of the annotation process
7.Translation asymmetries and mistranslations in the TED-MWE corpus
7.Conclusions and future work
References

Published online: 8 May 2020

https://doi.org/10.1075/ivitra.24.02mon

References (41)

References

Mihael, A., Turchi, M., Tonelli, S., & Buitelaar, P. (2017). Leveraging bilingual terminology to improve machine translation in a CAT environment. In Natural Language Engineering, 23(5), 763–788.

Barreiro, A. (2008). Make it simple with paraphrases: Automated paraphrasing for authoring aids and machine translation (PhD Thesis, Universidade do Porto).

Bertoldi, N., Haddow, B., & Fouet, J.B. (2009). Improved minimum error rate training in MOSES. Prague Bull. Math. Linguistics, 91(1), 7–16.

Bertoldi, N., Cettolo, M., & Federico, M. (2013). Cache-based Online Adaptation for Machine Translation Enhanced Computer Assisted Translation. In Proceedings of Machine Translation Summit XIV. Nice, France.

Bouamor, D., Semmar, N., & Zweigenbaum, P. (2012). Identifying bilingual multi-word expressions for statistical machine translation. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC). Istanbul, Turkey

Boulaknadel, S., Daille, B., & Aboutajdine, D. (2008). A multi-word term extraction program for arabic language. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC). Marrakech, Morocco: European Language Resources Association (ELRA).

Brown, P. F., Della Pietra, V. J., Della Pietra, S. A., & Mercer R. L. (1993). The mathematics of statistical machine translation: Parameter estimation. Computational linguistics, 19(2), 263–311

Caseli, H., Villavicencio, A., Machado, A., & Finatto, M. J. (2009). Statistically-driven alignment- based multiword expression identification for technical domains. In Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications (pp. 1–8). Singapore: Association for Computational Linguistics.

Cettolo, M., Girardi, C., & Marcello, F. (2012). Wit3: Web inventory of transcribed and translated talks. In Proceedings of the 16th Conference of the European Association for Machine Translation (EAMT) (pp. 261–268). Trento, Italy.

Clark, J., Dyer, C., Lavie, A., & Smith, N. (2011). Better Hypothesis Testing for Statistical Machine Translation: Controlling for Optimizer Instability. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers-Volume 2 (pp. 176–181). Association for Computational Linguistics.

Constant, M., Eryiğit, G., Monti, J., Van Der Plas, L., Rasmich, C., Rosner, M., & Todirascu, A. (2017). Multiword expression processing: a survey. Computational Linguistics, 43(4), 837–892.

Dagan, I., & Church, K. (1994). Termight: Identifying and translating technical terminology. In Proceedings of the Fourth Conference on Applied Natural Language Processing (pp. 34–40). Association for Computational Linguistics.

Daille, B. (2001). Extraction de collocation à partir de textes. In TALN 2001 (Traitement automatique des langues naturelles).

Dorr, B. J. (1994). Machine translation divergences: A formal description and proposed solution. Computational Linguistics, 20(4), 597–663.

Marcello, F., Bertoldi, N., & Cettolo, M. (2008). IRSTLM: an open source toolkit for handling large scale language models. In INTERSPEECH 2008, 9th Annual Conference of the International Speech Communication Association (pp. 1618–1621). Brisbane, Australia.

Fu, B., Brennan, R., & O’Sullivan, D. (2009). Cross-lingual ontology mapping–an investigation of the impact of machine translation. In A. Gómez-Pérez, Y. Ding, Y. Yong (Eds.), Asian Semantic Web Conference (pp. 1–15). Berlin/Heidelberg: Springer.

Kauffmann, A., & Azar, J. (2013). Structural Asymmetries in Machine Translation: The case of English-Japanese. (PhD Thesis, University of Geneva).

Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., & Herbst, E. (2007). Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the ACL Companion Volume Proceedings of the Demo and Poster Sessions (pp. 177–180). Prague, Czech Republic: ACL.

Koehn, P., Och, F. J., & Marcu, D. (2003). Statistical phrase-based translation. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1 (pp. 48–54). Association for Computational Linguistics.

Koehn, P. (2005). Europarl: A Parallel Corpus for Statistical Machine Translation. In Conference Proceedings: the tenth Machine Translation Summit (pp. 79–86). Phuket, Thailand: AAMT.

Lambert, P., & Banchs, R. (2005). Data inferred multi-word expressions for statistical machine translation. In Proceedings of Machine Translation Summit X (pp. 396–403).

Lin, S.-C., Wang, J.-C., & Wang, J.-F. (2005). Translation Divergence Analysis and Processing for Mandarin-English Parallel Text Exploitation. In Proceedings of 17th Conference on Computational Linguistics and speech Processing (ROCLING 2005). Tainan, Taiwan.

Losnegaard, G. S., Sangati, F., Parra Escartín, C., Savary, A., Bargmann, S., & Monti, J. (2016). PARSEME Survey on MWE Resources. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC’16). Portorož, Slovenia.

Melamed, I. D. (1997). Automatic discovery of noncompositional compounds in parallel data. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 97–108).

Moirón, B. V., & Tiedemann, J. (2006). Identifying idiomatic expressions using automatic word-alignment. In Proceedings of the EACL 2006 Workshop on Multi-word expressions in a multilingual context (pp. 33–40).

Monti, J., & Todirascu, A. (2016). Multiword units translation evaluation in machine translation: another pain in the neck? In G. Corpas Pastor, J. Monti, R. Mitkov, & V. Seretan (Eds.), Workshop proceedings for Multi-word Units in Machine Translation and Translation Technology (MUMTTT 2015) (pp. 25–30). Geneva: Editions Tradulex.

Monti, J., Sangati, F., & Arcan, M. (2015). TED-MWE: a bilingual parallel corpus with MWE annotation. In C. Bosco, S. Tonelli, & F. M. Zanzotto (Eds.), Proceedings of the Second Italian Conference on Computational Linguistics (CLiC-it 2015) (pp. 193–197). Torino: Accademia University Press srl/Centro Altreitalie.

Monti, J. (2013). Processing in Machine Translation. Developing and using language resources for multi-word unit processing in Machine Translation (PhD Thesis in Linguistica Computazionale, Università degli Studi di Salerno, a.a. 2011–2012).

(2014). An English-Italian MWE dictionary. In Clic-it Proceedings 2014. Pisa University Press srl.

Och, F. J., & Ney, H. (2003). A systematic comparison of various statistical alignment models. Computational linguistics, 29 (1), 19–51.

Okita, T., & Way, A. (2010). Statistical Machine Translation with Terminology. In Proceedings of the First Symposium on Patent Information Processing (SPIP) (pp. 1–8).

Pause, P. E. (1997). Interlingual strategies in translation. In C. Hauenschil, & S. Heizmann (Eds.), Machine Translation and Translation Theory (pp. 175-190).

Sag, I. A., Baldwin, T., Bond, F., Copestake, A., & Flickinger, D. (2002). Multiword Expressions: A Pain in the Neck for NLP. In A. Gelbukh (Ed.), Computational Linguistics and Intelligent Text Processing. CICLing 2002. Lecture Notes in Computer Science, vol 2276. Berlin/Heidelberg: Springer.

Sangati, F., & Cranenburgh A. V. (2015). Multiword Expression Identification with Recurring Tree Fragments and Association Measures. In Proceedings of Annual conference of North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) (pp. 10–18). Denver, CO: Association for Computational Linguistics.

Savary, A., Sailer, M., Parmentier, Y., Rosnes, M., Rosén, V., Przepiórkowski, A., Krstev, C., Vincze, V., Wójtowicz, B., Losnegaard, G. S., Parra Escartín, C., Waszczuk, J., Constant, M., Osenova, P., & Sangati, F. (2015). PARSEME – PARSing and Multiword Expressions within a European multilingual network. In 7th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics (LTC 2015). Poznań, Poland.

Seretan, V., & Wehrli, E. (2007). Collocation translation based on sentence alignment and parsing. In Actes de la 14e conférence sur le Traitement Automatique des Langues Naturelles (TALN 2007) (pp. 401–410). Toulouse, France.

Sinha, R. M. K., & Thakur, A. (2005). Translation divergence in English-Hindi MT. In Proceedings of the 10th Annual Conference of the European Association for Machine Translation (EAMT 2005) (pp. 245–254). Budapest, Hungary.

Thurmair, G., & Aleksić, V. (2012). Creating term and lexicon entries from phrase tables. In Proceedings of the 16th Conference of the European Association for Machine Translation (EAMT 2012).

Tsvetkov, Y., & Wintner, S. (2010). Extraction of multiword expressions from small parallel corpora. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters (pp. 1256–1264). Association for Computational Linguistics.

Vintar, S., & Fišer, D. (2008). Harvesting multi-word expressions from parallel corpora. In LREC. European Language Resources Association.

Wu, C.C. & Chang, J. S. (2004). Bilingual Collocation Extraction Based on Syntactic and Statistical Analyses. Computational Linguistics and Chinese Language Processing, 9(1), 1–20.

Cited by (2)

Cited by two other publications

Cuadrado Rey, Analía & Lucía Navarro Brotons

2024. Aproximación a la traducción automática de culturemas gastronómicos en el ámbito turístico. Hikma 23:1 ► pp. 111 ff.

Monti, Johanna, Violeta Seretan, Gloria Corpas Pastor & Ruslan Mitkov

2018. Multiword units in machine translation and translation technology. In Multiword Units in Machine Translation and Translation Technology [Current Issues in Linguistic Theory, 341], ► pp. 2 ff.

This list is based on CrossRef data as of 29 july 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.