Vol. 25:1 (2023) ► pp.109–143
Automatic assessment of spoken-language interpreting based on machine-translation evaluation metrics
A multi-scenario exploratory study
Automated metrics for machine translation (MT) such as BLEU are customarily used because they are quick to compute and sufficiently valid to be useful in MT assessment. Whereas the instantaneity and reliability of such metrics are made possible by automatic computation based on predetermined algorithms, their validity is primarily dependent on a strong correlation with human assessments. Despite the popularity of such metrics in MT, little research has been conducted to explore their usefulness in the automatic assessment of human translation or interpreting. In the present study, we therefore seek to provide an initial insight into the way MT metrics would function in assessing spoken-language interpreting by human interpreters. Specifically, we selected five representative metrics – BLEU, NIST, METEOR, TER and BERT – to evaluate 56 bidirectional consecutive English–Chinese interpretations produced by 28 student interpreters of varying abilities. We correlated the automated metric scores with the scores assigned by different types of raters using different scoring methods (i.e., multiple assessment scenarios). The major finding is that BLEU, NIST, and METEOR had moderate-to-strong correlations with the human-assigned scores across the assessment scenarios, especially for the English-to-Chinese direction. Finally, we discuss the possibility and caveats of using MT metrics in assessing human interpreting.
Article outline
- 1.Introduction
- 2.Human versus automatic assessment of interpreting
- 2.1Human assessment: rater types and scoring methods
- 2.2Automatic assessment: Evaluation metrics of machine translation
- 2.2.1An overview of evaluation metrics
- 2.2.2An introduction to BLEU, NIST, METEOR, TER and BERT
- 2.3Use of automated metrics in translation assessment
- 3.Research questions
- 4.Method
- 4.1Interpreting samples
- 4.2Human raters
- 4.3Scoring methods
- 4.4Analysis of the human-assigned scores
- 4.5Computation of evaluation metrics
- 4.6Data analysis
- 5.Results
- 5.1Inter-metric and inter-scenario correlation
- 5.2Overall correlations between metric scores and human-assigned scores
- 5.3Correlations based on metrics, rater types and scoring methods
- 6.Discussion
- 7.Conclusion
- Notes
-
References
https://doi.org/10.1075/intp.00076.lu