An item-based, Rasch-calibrated approach to assessing translation quality


Item-based scoring has been advocated as a psychometrically robust approach to translation quality assessment, outperforming traditional neo-hermeneutic and error analysis methods. The past decade has witnessed a succession of item-based scoring methods being developed and trialed, ranging from calibration of dichotomous items to preselected item evaluation. Despite this progress, these methods seem to be undermined by several limitations, such as the inability to accommodate the multifaceted reality of translation quality assessment and inconsistent item calibration procedures. Against this background, we conducted a methodological exploration, utilizing what we call an item-based, Rasch-calibrated method, to measure translation quality. This new method, built on the sophisticated psychometric model of many-facet Rasch measurement, inherits the item concept from its predecessors, but addresses previous limitations. In this article, we demonstrate its operationalization and provide an initial body of empirical evidence supporting its reliability, validity, and utility, as well as discuss its potential applications.

Publication history
Table of contents

Translation quality assessment (TQA) is one of the most contested topics in Translation Studies (TS), inviting much theoretical and methodological discussion (e.g., Lauscher 2000; House 2015; Han 2020). Running in parallel to scholarly theorization of translation quality and its assessment is a diverse array of scoring methods used to measure translation quality (Waddington 2001; Eyckmans, Segers, and Anckaert 2012). In general, these methods can be categorized into two major approaches: (a) item-based assessment, which requires raters to focus primarily on local or itemized elements of translation (e.g., lexical/phrasal constructions, grammatical structures), exemplified by the conventional method of error analysis (EA) (ATA’s error marking, see Teague [1987]; Canada’s Sical system, see Williams [1989]); and (b) rubric-referenced assessment, also known as rubric scoring, which relies on a rubric (a graduated set of performance descriptors) to generate a more global understanding of translation quality, advocated by a growing number of translation educators and researchers (Waddington 2001; Colina 2008, 2009; Angelelli 2009; Turner, Lai, and Huang 2010). Although rubric scoring is regarded as a useful method (Colina 2009; Turner, Lai, and Huang 2010), the past decade has witnessed substantial methodological progress of item-based assessment, with TS researchers exploring new practices. The highlight of the latest development is an upgrade of EA into psychometrically sound scoring methods, including: (a) calibration of dichotomous items (CDI) (Eyckmans, Anckaert, and Segers 2009), and (b) preselected item evaluation (PIE) (Kockaert and Segers 2012, 2017).

Full-text access is restricted to subscribers. Log in to obtain additional credentials. For subscription information see Subscription & Price. Direct PDF access to this article can be purchased through our e-platform.


Angelelli, Claudia V.
2009 “Using a Rubric to Assess Translation Ability: Defining the Construct.” In Testing and Assessment in Translation and Interpreting Studies: A Call for Dialogue between Research and Practice, edited by Claudia V. Angelelli and Holly E. Jacobson, 13–47. Amsterdam: John Benjamins. DOI logoGoogle Scholar
Bachman, Lyle F.
2004Statistical Analyses for Language Assessment. Cambridge: Cambridge University Press. DOI logoGoogle Scholar
Bond, Trevor G., and Christine M. Fox
2015Applying the Rasch Model: Fundamental Measurement in the Human Sciences. 3rd ed. New York: Routledge. DOI logoGoogle Scholar
Campbell, Stuart J.
1991 “Towards a Model of Translation Competence.” Meta 36 (2–3): 329–343. DOI logoGoogle Scholar
Colina, Sonia
2008 “Translation Quality Evaluation: Some Empirical Evidence for a Functionalist Approach.” The Translator 14 (1): 97–134. DOI logoGoogle Scholar
2009 “Further Evidence for a Functionalist Approach to Translation Quality Evaluation.” Target 21 (2): 235–264. DOI logoGoogle Scholar
Eckes, Thomas
2015Introduction to Many-Facet Rasch Measurement: Analyzing and Evaluating Rater-Mediated Assessments. 2nd ed. Frankfurt am Main: Peter Lang.Google Scholar
Eyckmans, June, and Philippe Anckaert
2017 “Item-based Assessment of Translation Competence: Chimera of Objectivity Versus Prospect of Reliable Measurement.” In Translator Quality – Translation Quality: Empirical Approaches to Assessment and Evaluation, edited by Geoffrey S. Koby and Isabel Lacruz, special issue of Linguistica Antverpiensia 16: 40–56.Google Scholar
Eyckmans, June, Philippe Anckaert, and Winibert Segers
2009 “The Perks of Norm Referenced Translation Evaluation.” In Testing and Assessment in Translation and Interpreting Studies: A Call for Dialogue between Research and Practice, edited by Claudia V. Angelelli and Holly E. Jacobson, 73–93. Amsterdam: John Benjamins. DOI logoGoogle Scholar
Eyckmans, June, Winibert Segers, and Philippe Anckaert
2012 “Translation Assessment Methodology and the Prospects of European Collaboration.” In Collaboration in Language Testing and Assessment, edited by Dina Tsagari and Ildikó Csépes, 171–184. Frankfurt am Main: Peter Lang.Google Scholar
Green, Rita
2013Statistical Analyses for Language Testers. Basingstoke: Palgrave Macmillan. DOI logoGoogle Scholar
Han, Chao
2015 “Investigating Rater Severity/Leniency in Interpreter Performance Testing: A Multifaceted Rasch Measurement Approach.” Interpreting 17 (2): 255–283. DOI logoGoogle Scholar
2016 “Investigating Score Dependability in English/Chinese Interpreter Certification Performance Testing: A Generalizability Theory Approach.” Language Assessment Quarterly 13 (3): 186–201. DOI logoGoogle Scholar
2017 “Using Analytic Rating Scales to Assess English/Chinese Bi-directional Interpretation: A Longitudinal Rasch Analysis of Scale Utility and Rater Behavior.” In Translator Quality – Translation Quality: Empirical Approaches to Assessment and Evaluation, edited by Geoffrey S. Koby and Isabel Lacruz, special issue of Linguistica Antverpiensia 16: 196–215.Google Scholar
2019 “A Generalizability Theory Study of Optimal Measurement Design for a Summative Assessment of English/Chinese Consecutive Interpreting.” Language Testing 36 (3): 419–438. DOI logoGoogle Scholar
2020 “Translation Quality Assessment: A Critical Methodological Review.” The Translator 26 (3): 257–273. DOI logoGoogle Scholar
Han, Chao, Rui Xiao, and Wei Su
2021 “Assessing the Fidelity of Consecutive Interpreting: The Effects of Using Source Versus Target Text as the Reference Material.” Interpreting 23 (2): 245–268. DOI logoGoogle Scholar
House, Juliane
2015Translation Quality Assessment: Past and Present. Abingdon: Routledge.Google Scholar
IBM Corp
2012IBM SPSS Statistics for Windows. V. 21.0. Armonk, NY: IBM Corp.Google Scholar
Kockaert, Hendrik J., and Winibert Segers
2012 “L’assurance qualité des traductions: items sélectionnés et évaluation assistée par ordinateur [Quality assurance of translations: Selected items and computer-assisted evaluation].” Meta 57 (1): 159–176. DOI logoGoogle Scholar
2017 “Evaluation of Legal Translations: PIE Method (Preselected Items Evaluation).” JoSTrans 27: 148–163.Google Scholar
Lauscher, Susanne
2000 “Translation Quality Assessment: Where Can Theory and Practice Meet?The Translator 6 (2): 149–168. DOI logoGoogle Scholar
Linacre, John M.
1989Many-Facet Rasch Measurement. Chicago: MESA Press.Google Scholar
1999 “Investigating Rating Scale Category Utility.” Journal of Outcome Measurement 3 (2): 103–122.Google Scholar
2002 “What Do Infit and Outfit, Mean-Square and Standardized Mean?Rasch Measurement Transactions 16 (2): 878.Google Scholar
2017FACETS: Computer Program for Many Faceted Rasch Measurement. V. 3.80.0. Beaverton, OR: Winsteps.Google Scholar
Martínez Mateo, Robert
2014 “A Deeper Look into Metrics for Translation Quality Assessment (TQA): A Case Study.” Miscelanea 49: 73–93.Google Scholar
McAlester, Gerard
2000 “The Evaluation of Translation into a Foreign Language.” In Developing Translation Competence, edited by Christina Schäffner and Beverly Adab, 229–241. Amsterdam: John Benjamins. DOI logoGoogle Scholar
Myford, Carol M., and Edward W. Wolfe
2003 “Detecting and Measuring Rater Effects Using Many-Facet Rasch Measurement: Part I.” Journal of Applied Measurement 4 (4): 386–422.Google Scholar
O’Brien, Sharon
2012 “Towards a Dynamic Quality Evaluation Model for Translation.” JoSTrans 17: 55–77.Google Scholar
Pym, Anthony
1992 “Translation Error Analysis and the Interface with Language Teaching.” In Teaching Translation and Interpreting: Training, Talent and Experience. Papers from the First Language International Conference, Elsinore, Denmark, 1991, edited by Cay Dollerup and Anne Loddegaard, 279–288. Amsterdam: John Benjamins. DOI logoGoogle Scholar
Rasch, Georg
1980Probabilistic Models for Some Intelligence and Attainment Tests. Chicago: MESA Press.Google Scholar
Teague, Ben
1987 “ATA Accreditation and Excellence in Practice.” In Translation Excellence: Assessment, Achievement, Maintenance, edited by Marilyn Gaddis Rose, 21–26. Amsterdam: John Benjamins.Google Scholar
Turner, Barry, Miranda Lai, and Neng Huang
2010 “Error Deduction and Descriptors – A Comparison of Two Methods of Translation Test Assessment.” Translation & Interpreting 2 (1): 11–23.Google Scholar
Waddington, Christopher
2001 “Should Translations Be Assessed Holistically or Through Error Analysis?Hermes 26: 15–38.Google Scholar
Williams, Malcolm
1989 “The Assessment of Professional Translation Quality: Creating Credibility out of Chaos.” TTR 2 (2): 13–33. DOI logoGoogle Scholar
Wind, Stefanie A., and Meghan E. Peterson
2018 “A Systematic Review of Methods for Evaluating Rating Quality in Language Assessment.” Language Testing 35 (2): 161–192. DOI logoGoogle Scholar