Can a corpus-driven lexical analysis of human and machine translation unveil discourse features that set them apart?
There is still much to learn about the ways in which human and machine translation differ with regard to the contexts that regulate the production and interpretation of discourse. The present study explores whether a corpus-driven lexical analysis of human and machine translation can unveil discourse features that set the two apart. A balanced corpus of source texts aligned with authentic, professional translations and neural machine translations was compiled for the study. Lexical discrepancies in the two translation corpora were then extracted via a corpus-driven keyword analysis, and examined qualitatively through parallel concordances of source texts aligned with human and machine translation. The study shows that keyword analysis not only reiterates known problems of discourse in machine translation such as lexical inconsistency and pronoun resolution, but can also provide valuable insights regarding contextual aspects of translated discourse deserving further research.
Keywords: machine translation, MT, professional translation, discourse, parallel corpora, keyword analysis
- 4.1Grammatical keywords
- 4.2Lexical keywords
- 4.2.2Proper names
- 4.2.3Foreign words
- 4.1Grammatical keywords
- 5.Discussion and conclusion
Available under the Creative Commons Attribution (CC BY) 4.0 license.
For any use beyond this license, please contact the publisher at email@example.com.
Published online: 08 September 2021
Carpuat, Marine, and Michel Simard
2012 “The Trouble with SMT Consistency.” In Proceedings of the Seventh Workshop on Statistical Machine Translation, Montréal, Canada, 7–8 June, edited by Chris Callison-Burch, Philipp Koehn, Christof Monz, Matt Post, Radu Soricut, and Lucia Specia, 442–449. Stroudsburg: Association for Computational Linguistics.
Catford, John C.
2010 (Version 13.1.17.) Accessed April 12, 2019. http://www.linguateca.pt/COMPARA/index.php
De Beaugrande, Robert, and Wolfgang Dressler
Dougal, Duane K., and Deryle Lonsdale
2020 “Improving NMT Quality Using Terminology Injection.” In Proceedings of the Twelfth International Conference on Language Resources and Evaluation, Marseille, France, 11–16 May, edited by Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, and Stelios Piperidis, 4820–4827. Paris: European Language Resources Association. https://www.aclweb.org/anthology/2020.lrec-1.593.pdf
2009 “Are Translations Longer than Source Texts? A Corpus-Based Study of Explicitation.” In Corpus Use and Translating: Corpus Use for Learning to Translate and Learning Corpus Use to Translate: An Introduction, edited by Allison Beeby, Patricia Rodríguez Inés, and Pilar Sánchez-Gijón, 47–58. Amsterdam: John Benjamins.
Frankenberg-Garcia, Ana, and Diana Santos
Google Translator Toolkit
(2019) Accessed December 1, 2019. https://translate.google.com/toolkit
2013 “Analysing Lexical Consistency in Translation.” In Proceedings of the Workshop on Discourse in Machine Translation, Soa, Bulgaria, 9 August, edited by Bonnie Webber, Andrei Popescu-Belis, Katja Markert, and Jörg Tiedemann, 10–18. Stroudsburg: Association for Computational Linguistics. https://www.aclweb.org/anthology/W13-3302.pdf
Guillou, Liane, Christian Hardmeier, Ekaterina Lapshinova-Koltunski, and Sharid Loáiciga
2018 “A Pronoun Test Suite Evaluation of the English–German MT Systems at WMT 2018.” In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, Brussels, Belgium, 31 October – 1 November, edited by Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Christof Monz, Matteo Negri, Aurélie Névéol, Mariana Neves, Matt Post, Lucia Specia, Marco Turchi, and Karin Verspoor, 570–577. Stroudsburg: Association for Computational Linguistics.
Halliday, M. A. K.
2009 “Simple Maths for Keywords.” In Proceedings of Corpus Linguistics Conference, Liverpool, UK. http://ucrel.lancs.ac.uk/publications/cl2009/
Kilgarriff, Adam, Vit Baisa, Jan Bušta, Miloš Jakubíček, Vojtěch Kovář, Jan Michelfeit, Pavel Rychlý, and Vit Suchomel
2005 “Europarl: A Parallel Corpus for Statistical Machine Translation.” In Proceedings of the Tenth Machine Translation Summit, Phuket, Thailand, 12–16 September, 79–86. Tokyo: Asia-Pacific Association for Machine Translation. https://homepages.inf.ed.ac.uk/pkoehn/publications/europarl-mtsummit05.pdf
Koehn, Philipp, and Josh Schroeder
Lapshinova-Koltunski, Ekaterina, and Christian Hardmeier
2017 “Discovery of Discourse-Related Language Contrasts through Alignment Discrepancies in English–German Translation.” In Proceedings of the Third Workshop on Discourse and Machine Translation, Copenhagen, Denmark, 8 September, edited by Bonnie Webber, Andrei Popescu-Belis, and Jörg Tiedemann, 73–81.
Läubli, Samuel, Rico Sennrich, and Martin Volk
2018 “Has Machine Translation Achieved Human Parity? A Case for Document-Level Evaluation.” In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October – 4 November, edited by Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, 4791–4796. Stroudsburg: Association for Computational Linguistics.
Luong, Ngoc-Quang, and Andrei Popescu-Belis
Luong, Ngoc-Quang, Andrei Popescu-Belis, Annette Rios Gonzales, and Don Tuggener
2017 “Machine Translation of Spanish Personal and Possessive Pronouns Using Anaphora Probabilities.” In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Vol 2, Short Papers, Valencia, Spain, 3–7 April, edited by Mirella Lapata, Phil Blunsom, and Alexander Koller, 631–636. Stroudsburg: Association for Computational Linguistics.
Morante, Roser, and Caroline Sporleder
2016 “Negation and Modality in Machine Translation.” In Proceedings of the Workshop on Extra-Propositional Aspects of Meaning in Computational Linguistics, Osaka, Japan, 12 December, edited by Eduardo Blanco, Roser Morante, and Roser Saurí, 411. Stroudsburg: Association for Computational Linguistics. https://www.aclweb.org/anthology/W16-5005.pdf
Popescu-Belis, Andrei, Sharid Loáiciga, Christian Hardmeier, and Deyi Xiong
eds. 2019 Proceedings of the Fourth Workshop on Discourse in Machine Translation, Hong Kong, China, 3 November. Stroudsburg: Association for Computational Linguistics. https://www.aclweb.org/anthology/volumes/D19-65/
2012 “Parallel Data, Tools and Interfaces in OPUS.” In Proceedings of the 8th International Conference on Language Resources and Evaluation, Istanbul, Turkey, edited by Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, and Stelios Piperidis, 2214–2218. Stroudsburg: Association for Computational Linguistics. http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper.pdf
1990 “Professional vs. Non-Professional Translation: A Think-Aloud Protocol Study.” In Learning, Keeping and Using Language: Selected Papers from the Eighth World Congress of Applied Linguistics, Sydney, 16–21 August 1987, edited by M. A. K. Halliday, John Gibbons, and Howard Nicholas, 381–394. Amsterdam: John Benjamins.
Toral, Antonio, and Andy Way
2016 “Found in Translation: More Accurate, Fluent Sentences in Google Translate.” Google (blog), November 15 2016 https://blog.google/products/translate/found-translation-more-accurate-fluent-sentences-google-translate/
Van Dijk, Teun A.
Vinay, Jean-Paul, and Jean Darbelnet
Webber, Bonnie, Andrei Popescu-Belis, and Jörg Tiedemann
eds. 2017 Proceedings of the Third Workshop on Discourse in Machine Translation, Copenhagen, Denmark, 8 September. https://www.aclweb.org/anthology/W17-4800