Rater-mediated performance assessment (RMPA) is a critical component of interpreter certification testing systems worldwide. Given
the acknowledged rater variability in RMPA and the high-stakes nature of certification testing, it is crucial to ensure rater
reliability in interpreter certification performance testing (ICPT). However, a review of current ICPT practice indicates that
rigorous research on rater reliability is lacking. Against this background, the present study reports on use of multifaceted Rasch
measurement (MFRM) to identify the degree of severity/leniency in different raters’ assessments of simultaneous interpretations
(SIs) by 32 interpreters in an experimental setting. Nine raters specifically trained for the purpose were asked to evaluate four
English-to-Chinese SIs by each of the interpreters, using three 8-point rating scales (information content, fluency, expression).
The source texts differed in speed and in the speaker’s accent (native vs non-native). Rater-generated scores were then subjected
to MFRM analysis, using the FACETS program. The following general trends emerged: 1) homogeneity statistics showed that not all
raters were equally severe overall; and 2) bias analyses showed that a relatively large proportion of the raters had significantly
biased interactions with the interpreters and the assessment criteria. Implications for practical rating arrangements in ICPT, and
for rater training, are discussed.
(1993) A psychometric approach to the selection of translation and interpreting students in Taiwan. Perspectives 1 (1), 91–104.
Arocha, I.S. & Joyce, L.
(2013) Patient safety, professionalization, and reimbursement as primary drivers for National Medical Interpreter Certification in the United States. Translation & Interpreting 5 (1), 127–142.
Bond, T.G. & Fox, C.M.
(2007) Applying the Rasch model: Fundamental measurement in the human sciences (2nd ed.). London: Lawrence Erlbaum.
Bonk, W.J. & Ockey, G.J.
(2003) A many-facet Rasch analysis of the second language group oral discussion task. Language Testing 20 (1), 89–110.
Brown, A.
(1995) The effect of rater variables in the development of an occupation-specific language performance test. Language Testing 12 (1), 1–15.
Campbell, S. & Hale, S.
(2003) Translation and interpreting assessment in the context of educational measurement. In G. Anderman & M. Rogers (Eds.), Translation today: Trends and perspectives. Clevedon: Multilingual Matters, 205–224.
Certification Commission for Healthcare Interpreters
(2010) Job task analysis study and results. [URL] (accessed 22 May 2015).
Certification Commission for Healthcare Interpreters
(2011) Technical Report on the Development and Pilot Testing of the CCHI Examinations. [URL] (accessed 22 May 2015).
Certification Commission for Healthcare Interpreters
(2012) Technical Report on the Development and Pilot Testing of the Certified Healthcare Interpreter™ (CHI™) Examination for Arabic and Mandarin. [URL] (accessed 22 May 2015).
Certification Commission for Healthcare Interpreters
(2014) Candidate’s Examination Handbook. [URL] (accessed 22 May 2015).
(1998) Uses of Rasch modeling in counseling psychology research. Journal of Counseling Psychology 45 (1), 30–45.
Gile, D.
(1995) Basic concepts and models for interpreter and translator training. Amsterdam: John Benjamins.
Green, R.
(2013) Statistical analysis for language testers. Basingstoke: Palgrave Macmillan.
Hale, S., Garcia, I., Hlavac, J., Kim, M., Lai, M., Turner, B. & Slatyer, H.
(2012) Development of a conceptual overview for a new model for NAATI standards, testing and assessment. Sydney, Australia. [URL] (accessed 22 May 2015).
Han, C. & Mehdi, R.
(2015) The effects of speech rate and accent on interpreter performance quality: A mixed-methods replication study. Manuscript submitted for publication.
Henning, G.
(1992) Dimensionality and construct validity of language tests. Language Testing 9 (1), 1–11.
Hlavac, J.
(2013) A cross-national overview of translator and interpreter certification procedures. Translation & Interpreting 51, 32–65.
IoL Educational Trust
(2010) Diploma in Public Service Interpreting: Handbook for candidates. London, UK. [URL] (accessed 22 May 2015).
Jacobs, E.A., Lauderdale, D.S., Meltzer, D., Shorey, J.M., Levinson, W. & Thisted, R.A.
(2001) Impact of interpreter services on delivery of health care to limited-English-proficient patients. Journal of General Internal Medicine 16 (7), 468–474.
Knoch, U.
(2011) Investigating the effectiveness of individualized feedback to rating behavior – a longitudinal study. Language Testing 28 (2), 179–200.
Kondo-Brown, K.
(2002) A FACETS analysis of rater bias in measuring Japanese second language writing performance. Language Testing 19 (1), 3–31.
Linacre, J.M.
(2002) What do infit and outfit, mean-square and standardized mean?Rasch Measurement Transactions 16 (2), 878.
Linacre, J.M.
(2013). A user’s guide to FACETS: Program manual 3.71.2. [URL] (accessed 22 May 2015).
Liu, M.
(2013) Design and analysis of Taiwan’s interpretation certification examination. In D. Tsagari & R. van Deemter (Eds.), Assessment issues in language translation and interpreting. Frankfurt: Peter Lang, 163–178.
Lu, M., Liu, C. & Gong, X.F.
(2007) 全国翻译专业资格(水平)考试英语口译试题命制一致性研究报告. [How to maintain consistency in CATTI’s interpretation tests: A research report]. 中国翻译, 51, 57–61.
Lumley, T. & McNamara, T.F.
(1995) Rater characteristics and rater bias: implications for training. Language Testing 12 (1), 54–71.
Lunz, M.E. & Stahl, J.A.
(1990) Judge consistency and severity across grading periods. Evaluation and the Health Professions 13 (4), 425–444.
Lynch, B.K. & McNamara, T.F.
(1998) Using G-theory and many-facet Rasch measurement in the development of performance assessments of the ESL speaking skills of immigrants. Language Testing 15 (2), 158–180.
Masters, G.N.
(1982) A Rasch model for partial credit scoring. Psychometrika 47 (2), 149–174.
McNamara, T.F.
(1996) Measuring second language performance. London: Longman.
McNamara, T.F. & Knoch, U.
(2012) The Rasch wars: The emergence of Rasch measurement in language testing. Language Testing 29 (4) 555–576.
Messick, S.
(1989) Validity. In R.L. Linn (Ed.), Educational measurement (3rd ed.). New York: American Council on Education and Macmillan, 13–103.
Mortensen, D.
(1998) Establishing a scheme for interpreter certification: The Norwegian experience. [URL] (accessed 22 May 2015).
Mortensen, D.
(2001) Measuring quality in interpreting: A report on the Norwegian Interpreter Certification Examination (NICE). Oslo, Norway. [URL] (accessed 22 May 2015).
Myford, C.M. & Wolfe, E.W.
(2003) Detecting and measuring rater effects using many-facet Rasch measurement: Part I. Journal of Applied Measurement 4 (4), 386–422.
National Accreditation Authority for Translators and Interpreters
(2013) INT Project discussion paper. [URL] (accessed 22 May 2015).
National Association of the Deaf
(2014) NAD and RID releases the NAD-RID National Interpreter Certification (NIC) Credential Validity, Reliability, & Candidate Performance Report. [URL] (accessed 22 May 2015).
National Board of Certification for Medical Interpreters
(2014) The National Board of Certification for Medical Interpreters: Certified Medical Interpreter candidate handbook. [URL] (accessed 22 May 2015).
National Center for States Courts
(2013) Federal Court Interpreter Certification Examination for Spanish/English: Examinee handbook. [URL] (accessed 22 May 2015).
Office of China Accreditation Tests for Translators and Interpreters
(2005) 二级口译英语同声传译类考试大纲. 外文出版社. [Syllabus of CATTI Level-two Simultaneous Interpreting Test]. Beijing: Foreign Languages Press.
PSI Services LLC
(2010) Development and validation of oral and written examinations for medical interpreter certification: Technical report. Burbank, California, USA. [URL] (accessed 22 May 2015).
PSI Services LLC
(2013) Development and validation of oral examinations for Medical Interpreter Certification: Mandarin, Russian, Cantonese, Korean, and Vietnamese forms. [URL] (accessed 22 May 2015).
Roat, C.E.
(2006) Certification of health care interpreters in the United States: A primer, a status report and considerations for national certification. Los Angeles, CA. [URL] (accessed 22 May 2015).
(2005) Examining the predictive validity of cut scores on a screening test for court interpreters. Language Testing 22 (2), 1–25.
Sudweeks, R., Reeve, S. & Bradshaw, W.S.
(2005) A comparison of generalizability theory and many-facet Rasch measurement in an analysis of college sophomore writing. Assessing Writing 91, 239–261.
Turner, B., Lai, M. & Huang, N.
(2010) Error deduction and descriptors – a comparison of two methods of translation test assessment. Translation & Interpreting 2 (1), 11–23.
Upshur, J.A. & Turner, C.E.
(1999) Systematic effects in the rating of second-language speaking ability: test method and leaner discourse. Language Testing 16 (1), 82–111.
(1994) Effects of training on raters of ESL compositions. Language Testing 11(2), 197–223.
Weigle, S.C.
(1998) Using FACETS to model rater training effects. Language Testing 15(2), 263–287.
Wigglesworth, G.
(1993) Exploring bias analysis as a tool for improving rater consistency in assessing oral interaction. Language Testing 10 (3), 305–319.
Wu, S.C.
(2010) Assessing simultaneous interpreting: A study on test reliability and examiners’ assessment behavior. PhD thesis, Newcastle University.
Youdelman, M.
(2013) The development of certification for healthcare interpreters in the United States. Translation & Interpreting 5 (1), 114–126.
Yu, D.R.
(2005) T&I labor market in China. Sydney, Australia. [URL] (accessed 22 May 2015).
Cited by
Cited by 31 other publications
Abdel Latif, Muhammad M. M.
2018. Towards a typology of pedagogy-oriented translation and interpreting research. The Interpreter and Translator Trainer 12:3 ► pp. 322 ff.
Abdel Latif, Muhammad M. M.
2020. Translation and Interpreting Assessment Research. In Translator and Interpreter Education Research [New Frontiers in Translation Studies, ], ► pp. 61 ff.
Chen, Hua, Ying Wang & T. Pascal Brown
2021. The effects of topic familiarity on information completeness, fluency, and target language quality of student interpreters in Chinese–English consecutive interpreting. Across Languages and Cultures 22:2 ► pp. 176 ff.
Chen, Jing & Chao Han
2021. Testing and Assessment of Interpreting in China: An Overview. In Testing and Assessment of Interpreting [New Frontiers in Translation Studies, ], ► pp. 1 ff.
Chen, Jing, Huabo Yang & Chao Han
2022. Holistic versus analytic scoring of spoken-language interpreting: a multi-perspectival comparative analysis. The Interpreter and Translator Trainer 16:4 ► pp. 558 ff.
Chen, Sijia
2022. The process and product of note-taking and consecutive interpreting: empirical data from professionals and students. Perspectives 30:2 ► pp. 258 ff.
2018. A longitudinal quantitative investigation into the concurrent validity of self and peer assessment applied to English-Chinese bi-directional interpretation in an undergraduate interpreting course. Studies in Educational Evaluation 58 ► pp. 187 ff.
2019. A generalizability theory study of optimal measurement design for a summative assessment of English/Chinese consecutive interpreting. Language Testing 36:3 ► pp. 419 ff.
Han, Chao
2019. Conceptualizing and Operationalizing a Formative Assessment Model for English-Chinese Consecutive Interpreting. In Quality Assurance and Assessment Practices in Translation and Interpreting [Advances in Linguistics and Communication Studies, ], ► pp. 89 ff.
Han, Chao
2021. Analytic rubric scoring versus comparative judgment: a comparison of two approaches to assessing spoken-language interpreting. Meta 66:2 ► pp. 337 ff.
Han, Chao
2021. Detecting and Measuring Rater Effects in Interpreting Assessment: A Methodological Comparison of Classical Test Theory, Generalizability Theory, and Many-Facet Rasch Measurement. In Testing and Assessment of Interpreting [New Frontiers in Translation Studies, ], ► pp. 85 ff.
2022. Interpreting testing and assessment: A state-of-the-art review. Language Testing 39:1 ► pp. 30 ff.
Han, Chao & Kerui An
2021. Using unfilled pauses to measure (dis)fluency in English-Chinese consecutive interpreting: in search of an optimal pause threshold(s). Perspectives 29:6 ► pp. 917 ff.
Han, Chao & Qin Fan
2020. Using self-assessment as a formative assessment tool in an English-Chinese interpreting course: student views and perceptions of its utility. Perspectives 28:1 ► pp. 109 ff.
Han, Chao & Xiaolei Lu
2023. Can automated machine translation evaluation metrics be used to assess students’ interpretation in the language learning classroom?. Computer Assisted Language Learning 36:5-6 ► pp. 1064 ff.
2021. Accuracy of peer ratings on the quality of spoken-language interpreting. Assessment & Evaluation in Higher Education 46:8 ► pp. 1299 ff.
Lamprianou, Iasonas, Dina Tsagari & Nansia Kyriakou
2023. Experienced but detached from reality: Theorizing and operationalizing the relationship between experience and rater effects. Assessing Writing 56 ► pp. 100713 ff.
2021. Exploring a Corpus-Based Approach to Assessing Interpreting Quality. In Testing and Assessment of Interpreting [New Frontiers in Translation Studies, ], ► pp. 159 ff.
2017. Conference interpreting: a trainer’s guide. Perspectives 25:4 ► pp. 682 ff.
Shang, Xiaoqi
2021. Developing a Weighting Scheme for Assessing Chinese-to-English Interpreting: Evidence from Native English-Speaking Raters. In Testing and Assessment of Interpreting [New Frontiers in Translation Studies, ], ► pp. 45 ff.
Song, Shuxian & Dechao Li
2023. Aptitude for interpreting: the predictive value of cognitive fluency. The Interpreter and Translator Trainer 17:1 ► pp. 155 ff.
Zhao, Nan
2023. A validation study of a consecutive interpreting test using many-facet Rasch analysis. Frontiers in Communication 7
This list is based on CrossRef data as of 26 november 2023. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.