Automatic assessment of spoken-language interpreting based on machine-translation evaluation metrics
A multi-scenario exploratory study
Automated metrics for machine translation (MT) such as BLEU are customarily used because they are quick to compute and sufficiently valid to be useful in MT assessment. Whereas the instantaneity and reliability of such metrics are made possible by automatic computation based on predetermined algorithms, their validity is primarily dependent on a strong correlation with human assessments. Despite the popularity of such metrics in MT, little research has been conducted to explore their usefulness in the automatic assessment of human translation or interpreting. In the present study, we therefore seek to provide an initial insight into the way MT metrics would function in assessing spoken-language interpreting by human interpreters. Specifically, we selected five representative metrics – BLEU, NIST, METEOR, TER and BERT – to evaluate 56 bidirectional consecutive English–Chinese interpretations produced by 28 student interpreters of varying abilities. We correlated the automated metric scores with the scores assigned by different types of raters using different scoring methods (i.e., multiple assessment scenarios). The major finding is that BLEU, NIST, and METEOR had moderate-to-strong correlations with the human-assigned scores across the assessment scenarios, especially for the English-to-Chinese direction. Finally, we discuss the possibility and caveats of using MT metrics in assessing human interpreting.
Article outline
- 1.Introduction
- 2.Human versus automatic assessment of interpreting
- 2.1Human assessment: rater types and scoring methods
- 2.2Automatic assessment: Evaluation metrics of machine translation
- 2.2.1An overview of evaluation metrics
- 2.2.2An introduction to BLEU, NIST, METEOR, TER and BERT
- 2.3Use of automated metrics in translation assessment
- 3.Research questions
- 4.Method
- 4.1Interpreting samples
- 4.2Human raters
- 4.3Scoring methods
- 4.4Analysis of the human-assigned scores
- 4.5Computation of evaluation metrics
- 4.6Data analysis
- 5.Results
- 5.1Inter-metric and inter-scenario correlation
- 5.2Overall correlations between metric scores and human-assigned scores
- 5.3Correlations based on metrics, rater types and scoring methods
- 6.Discussion
- 7.Conclusion
- Notes
-
References
References (42)
Banerjee, S. & Lavie, A.
(
2005)
METEOR: An automatic metric for MT evaluation with improved correlation with human judgments.
Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, 65–72.
[URL]
Callison-Bruch, C., Osborne, M. & Koehn, P.
(
2006)
Re-evaluating the role of BLEU in machine translation research.
Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics, 249–256.
[URL]
Chen, J., Yang, H-B. & Han, C.
(
2021)
Holistic versus analytic scoring of spoken-language interpreting: A multi-perspectival comparative analysis. Manuscript submitted for publication.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Christodoulides, G. & Lenglet, C.
(
2014)
Prosodic correlates of perceived quality and fluency in simultaneous interpreting. In
N. Campbell,
D. Gibbon &
D. Hirst (Eds.),
Proceedings of the 7th Speech Prosody Conference, 1002–1006.
[URL].
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
Chung, H-Y.
(
2020)
Automatic evaluation of human translation: BLEU vs. METEOR.
Lebende Sprachen
65
(1), 181–205.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Coughlin, D.
(
2003)
Correlating automated and human assessments of machine translation quality.
[URL]
Devlin, J., Chang, M-W., Lee, K. & Toutanova, K.
(
2018)
BERT: Pre-training of deep bidirectional transformers for language understanding.
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 4171–4186.
[URL]
Doddington, G.
(
2002)
Automatic evaluation of machine translation quality using N-gram co-occurrence statistics.
Proceedings of the Second International Conference on Human Language Technology Research, 138–145.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Ginther, A., Dimova, S. & Yang, R.
(
2010)
Conceptual and empirical relationships between temporal measures of fluency and oral English proficiency with implications for automated scoring.
Language Testing
27
(3), 379–399.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Han, C.
(
2022a)
Interpreting testing and assessment: A state-of-the-art review.
Language Testing
39
(1), 30–55.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Han, C. & Lu, X-L.
(
2021a)
Interpreting quality assessment re-imagined: The synergy between human and machine scoring.
Interpreting and Society
1
(1), 70–90.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Han, C. & Lu, X-L.
(
2021b)
Can automated machine translation evaluation metrics be used to assess students’ interpretation in the language learning classroom? Computer Assisted Language Learning, 1–24.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Han, C. & Xiao, X-Y.
(
2021)
A comparative judgment approach to assessing Chinese Sign Language interpreting.
Language Testing, 1–24.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Han, C., Hu, J. & Deng, Y.
forthcoming).
Effects of language background and directionality on raters’ assessments of spoken-language interpreting: An exploratory experimental study. Revista Española de Lingüística Aplicada.
Han, C., Chen, S-J., Fu, R-B. & Fan, Q.
International School of Linguists
(
2020)
Diploma in Public Service Interpreting learner handbook. London, UK.
[URL]
Koehn, P.
(
2010)
Statistical machine translation. Cambridge: Cambridge University Press.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Le, N-T., Lecouteux, B. & Besacier, L.
(
2018)
Automatic quality estimation for speech translation using joint ASR and MT features.
Machine Translation
32
(4), 325–351.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Lee, J.
(
2008)
Rating scales for interpreting performance assessment.
The Interpreter and Translator Trainer
2
(2), 165–184.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Liu, M-H.
(
2013)
Design and analysis of Taiwan’s interpretation certification examination. In:
D. Tsagari &
R. van Deemter (Eds.),
Assessment issues in language translation and interpreting. Frankfurt: Peter Lang, 163–178.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Liu, Y-M.
(
2021)
Exploring a corpus-based approach to assessing interpreting quality. In:
J. Chen &
C. Han (Eds.),
Testing and assessment of interpreting: Recent developments in China. Singapore: Springer, 159–178.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Loper, E. & Steven, B.
(
2002)
NLTK: the natural language toolkit.
Proceedings of the ACL-02 workshop on effective tools and methodologies for teaching natural language processing and computational linguistics, 63–70.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Mathur, N., Wei, J., Freitag, M., Ma, Q-S. & Bojar, O.
(
2020)
Results of the WMT20 metrics shared task.
Proceedings of the Fifth Conference on Machine Translation, 688–725.
[URL]
NAATI
(
2019)
Certified interpreter test assessment rubrics.
[URL]
Ouyang, L-W., Lv, Q-X. & Liang, J-Y.
(
2021)
Coh-Metrix model-based automatic assessment of interpreting quality. In:
J. Chen &
C. Han (Eds.),
Testing and assessment of interpreting: Recent developments in China. Singapore: Springer, 179–200.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Papineni, K., Roukos, S., Ward, T. & Zhu, W-J.
(
2002)
BLEU: A method for automatic evaluation of machine translation.
Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 311–318.
[URL]
Reiter, E.
(
2018)
A structured review of the validity of BLEU.
Computational Linguistics
44
(3), 393–401.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Sellam, T., Das, D. & Parikh, A. P.
(
2020)
BLEURT: Learning robust metrics for text generation.
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 7881–7892.
[URL].
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
Snover, M., Dorr, B., Schwartz, R., Micciulla, L. & Makhoul, J.
(
2006)
A study of translation edit rate with targeted human annotation.
Proceedings of the 7th Conference of the Association for Machine Translation in the Americas, 223–231.
[URL]
Stewart, C., Vogler, N., Hu, J-J., Boyd-Graber, J. & Neubig, G.
(
2018)
Automatic estimation of simultaneous interpreter performance.
The 56th Annual Meeting of the Association for Computational Linguistics.
[URL].
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
Su, W.
(
2019)
Exploring native English teachers’ and native Chinese teachers’ assessment of interpreting.
Language and Education
33
1, 577–594.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C-W., Le Scao, T., Gugger, S., Drame, M., Lhoest, Q. & Rush, A.
(
2020)
Transformers: State-of-the-art natural language processing.
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 38–45.
[URL].
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
Wu, S. C.
(
2010)
Assessing simultaneous interpreting: A study on test reliability and examiners’ assessment behavior.
[URL]
Wu, Z-W.
(
2021)
Chasing the unicorn? The feasibility of automatic assessment of interpreting fluency. In:
J. Chen &
C. Han (Eds.).
Testing and assessment of interpreting: Recent developments in China. Singapore: Springer, 143–158.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Yang, L-Y.
(
2015)
An exploratory study of fluency in English output of Chinese consecutive interpreting learners.
Journal of Zhejiang International Studies University
1
1, 60–68.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Yu, W-T. & van Heuven, V. J.
Zhang, M.
(
2013)
Contrasting automated and human scoring of essays.
R&D Connections
21
1.
[URL]
Cited by (1)
Cited by 1 other publications
Wang, Xiaoman & Lu Yuan
2023.
Machine-learning based automatic assessment of communication in interpreting.
Frontiers in Communication 8
![DOI logo](//benjamins.com/logos/doi-logo.svg)
This list is based on CrossRef data as of 4 july 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.