Chapter 5. Recent claims of human-machine parity in translation highlight core issues surrounding the human evaluation of machine translation

Gilbert, Devin

doi:10.1075/ata.xx.05gil

Part of

Translation in Transition: Human and machine intelligence
Edited by Isabel Lacruz
[American Translators Association Scholarly Monograph Series XX] 2023
► pp. 83–103

Chapter 5
Recent claims of human-machine parity in translation highlight core issues surrounding the human evaluation of machine translation

Devin Gilbert | Utah Valley University

In 2018, the first claims of empirical backing for human-machine parity in translation (HMPT) emerged at the WMT18 Conference on Machine Translation and in a study using WMT resources. Other researchers quickly refuted these claims, pointing to a flawed human evaluation campaign. Subsequent HMPT claims at WMT19 were also empirically refuted. This chapter discusses the evolution of recommendations for human evaluation of MT stemming from these claims to HMPT and evaluates possibilities of HMPT at WMT20 in the context of these recommendations. Finally, we summarize all criteria for human evaluation of MT based on recent literature.

Keywords: human-machine parity, human evaluation of machine translation, participant profiling, machine translation, end-user evaluation

Article outline

1.Introduction
2.2018: First claims of human-machine parity in translation
- 2.1Human evaluation of MT: Metrics, metrics, metrics
- 2.2Critics of Hassan et al.’s (2018) claims of HMPT
3.WMT19: Additional claims of HMPT
or even machine super-performance
4.WMT20: Continued innovation and greater caution
5.Conclusion
Notes
References

Published online: 26 July 2023

https://doi.org/10.1075/ata.xx.05gil

References (33)

References

Barrault, Loïc, Magdalena Biesialska, Ondřej Bojar, Marta R. Costa-jussà, Christian Federmann, Yvette Graham, Roman Grundkiewicz, et al. 2020. “Findings of the 2020 Conference on Machine Translation (WMT20).” In Proceedings of the Fifth Conference on Machine Translation, 1–54. Online: Association for Computational Linguistics. [URL]

Barrault, Loïc, Ondřej Bojar, Marta R. Costa-jussà, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, et al. 2019. “Findings of the 2019 Conference on Machine Translation (WMT19).” In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), 1–61. Florence, Italy: Association for Computational Linguistics.

Bentivogli, Luisa, Mauro Cettolo, Marcello Federico, and Federmann Christian. 2018. “Machine Translation Human Evaluation: An Investigation of Evaluation Based on Post-Editing and Its Relation with Direct Assessment.” In 15th International Workshop on Spoken Language Translation 2018, 62–69. [URL]

Bernth, Arendse, and Claudia Gdaniec. 2001. “MTranslatability.” Machine Translation 16 (3): 175–218.

Bojar, Ondřej, Rajen Chatterjee, Christian Federmann, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, et al. 2016. “Findings of the 2016 Conference on Machine Translation.” In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, 131–198. Berlin: Association for Computational Linguistics.

Bojar, Ondřej, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Philipp Koehn, and Christof Monz. 2018. “Findings of the 2018 Conference on Machine Translation (WMT18).” In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, 272–303. Brussels: Association for Computational Linguistics.

Callison-Burch, Chris. 2009. “Fast, Cheap, and Creative: Evaluating Translation Quality Using Amazon’s Mechanical Turk.” In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, 286–95.

Castilho, Sheila, Joss Moorkens, Federico Gaspari, Andy Way, Panayota Georgakopoulou, Maria Gialama, Vilelmini Sosoni, and Rico Sennrich. 2017. “Crowdsourcing for NMT Evaluation: Professional Translators versus the Crowd.” Translating and the Computer 39. [URL]

Castilho, Sheila, and Sharon O’Brien. 2017. “Acceptability of Machine-Translated Content: A Multi-Language Evaluation by Translators and End-Users.” Linguistica Antverpiensia, New Series–Themes in Translation Studies 16.

Castilho, Sheila, Sharon O’Brien, Fabio Alves, and Morgan O’Brien. 2014. “Does Post-Editing Increase Usability? A Study with Brazilian Portuguese as Target Language.” In Proceedings of the 17th Annual Conference of the European Association for Machine Translation, 183–190. Association for Computational Linguistics.

Egdom, G. M. W. van, and Mark Pluymaekers. 2019. “Why Go the Extra Mile? How Different Degrees of Post-Editing Affect Perceptions of Texts, Senders and Products among End Users.” Journal of Specialised Translation 31: 158–76.

Graham, Yvette, Christian Federmann, Maria Eskevich, and Barry Haddow. 2020a. “Assessing Human-Parity in Machine Translation on the Segment Level.” In Findings of the Association for Computational Linguistics: EMNLP 2020, 4199–4207. Online: Association for Computational Linguistics.

Graham, Yvette, Barry Haddow, and Philipp Koehn. 2020b. “Statistical Power and Translationese in Machine Translation Evaluation.” In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 72–81. Online: Association for Computational Linguistics. [URL].

Grimaila, Annette, and John Chandioux. 1992. “Made to Measure Solutions.” In Computers in Translation: A Practical Appraisal, ed. by John Newton, 33–45. London: Routledge.

Grundkiewicz, Roman, Marcin Junczys-Dowmunt, Christian Federmann, and Tom Kocmi. 2021. “On User Interfaces for Large-Scale Document-Level Human Evaluation of Machine Translation Outputs.” In Proceedings of the Workshop on Human Evaluation of NLP Systems (HumEval). Online from Kyiv, Ukraine. [URL]

Hassan, Hany, Anthony Aue, Chang Chen, Vishal Chowdhary, Jonathan Clark, Christian Federmann, Xuedong Huang, et al. 2018. “Achieving Human Parity on Automatic Chinese to English News Translation.” [URL]

Jääskeläinen, Riitta. 2010. “Are All Professionals Experts? Definitions of Expertise and Reinterpretation of Research Evidence.” Translation and Cognition 15: 213–27.

Kocmi, Tom, Tomasz Limisiewicz, and Gabriel Stanovsky. 2020. “Gender Coreference and Bias Evaluation at WMT 2020.” In Proceedings of the Fifth Conference on Machine Translation. Online: Association for Computational Linguistics.

Läubli, Samuel, Sheila Castilho, Graham Neubig, Rico Sennrich, Qinlan Shen, and Antonio Toral. 2020. “A Set of Recommendations for Assessing Human-Machine Parity in Language Translation.” Journal of Artificial Intelligence Research 67 (March).

Läubli, Samuel, Rico Sennrich, and Martin Volk. 2018. “Has Machine Translation Achieved Human Parity? A Case for Document-Level Evaluation.” ArXiv Preprint ArXiv:1808.07048.

Ng, Nathan, Kyra Yee, Alexei Baevski, Myle Ott, Michael Auli, and Sergey Edunov. 2019. “Facebook FAIR’s WMT19 News Translation Task Submission.” In Proceedings of the Fourth Conference on Machine Translation. Florence, Italy: Association for Computational Linguistics. [URL].

O’Brien, Sharon. 2013. “The Borrowers: Researching the Cognitive Aspects of Translation.” Target. International Journal of Translation Studies 25 (1): 5–17.

Popel, Martin. 2018. “CUNI Transformer Neural MT System for WMT18.” In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, 482–87. Belgium, Brussels: Association for Computational Linguistics.

Popel, Martin, Marketa Tomkova, Jakub Tomek, Łukasz Kaiser, Jakob Uszkoreit, Ondřej Bojar, and Zdeněk Žabokrtský. 2020. “Transforming Machine Translation: A Deep Learning System Reaches News Translation Quality Comparable to Human Professionals.” Nature Communications 11 (1): 4381.

Scarton, Carolina, Mikel L. Forcada, Miquel Esplà-Gomis, and Lucia Specia. 2019. “Estimating Post-Editing Effort: A Study on Human Judgements, Task-Based and Reference-Based Metrics of MT Quality.” In Zenodo 16th International Workshop on Spoken Language Translation. Hong Kong. [URL]

Shreve, Gregory M., and Erik Angelone, eds. 2010. Translation and Cognition. American Translators Association Scholarly Monograph Series, v. 15. Amsterdam; Philadelphia: John Benjamins Pub. Co..

Snover, Matthew, Bonnie Dorr, Richard Schwartz, Linnea Micciulla, and John Makhoul. 2006. “A Study of Translation Edit Rate with Targeted Human Annotation.” In Proceedings of the 7th Conference of the Association for Machine Translation of the Americas, 223–31. Cambridge, Massachusetts: Association for Machine Translation in the Americas. [URL]

Tomolonis, Tommy. 2020. Discussion with Tommy Tomolonis, Automation Technology Specialist at Morningside Translations.

Toral, Antonio. 2020. “Reassessing Claims of Human Parity and Super-Human Performance in Machine Translation at WMT 2019.” ArXiv:2005.05738 [Cs], May. [URL]

Toral, Antonio, Sheila Castilho, Ke Hu, and Andy Way. 2018. “Attaining the Unattainable? Reassessing Claims of Human Parity in Neural Machine Translation.” ArXiv Preprint ArXiv:1808.10432. [URL].

Toral, Antonio, and Andy Way. 2018. “What Level of Quality Can Neural Machine Translation Attain on Literary Text?” In Translation Quality Assessment, 263–87. Springer. [URL].

Way, Andy. 2018. “Machine translation: Where are we at today?” In The Bloomsbury Companion to Language Industry Studies. Bloomsbury, London.

Zouhar, Vilém, Tereza Vojtěchová, and Ondřej Bojar. 2020. “WMT20 Document-Level Markable Error Exploration.” In Proceedings of the Fifth Conference on Machine Translation, 371–80. Online: Association for Computational Linguistics.

Chapter 5Recent claims of human-machine parity in translation highlight core issues surrounding the human evaluation of machine translation

Chapter 5
Recent claims of human-machine parity in translation highlight core issues surrounding the human evaluation of machine translation