Chapter 2. Considerations in developing vertical scales for language tests

Monfils, Lora; Manna, Venessa F.

doi:10.1075/illa.1.02mon

Part of

Meaningful Language Test Scores: Research to enhance score interpretation
Edited by Spiros Papageorgiou and Venessa F. Manna
[Innovations in Language Learning and Assessment 1] 2023
► pp. 14–34

Chapter 2
Considerations in developing vertical scales for language tests

Lora Monfils | Educational Testing Service

Venessa F. Manna | Educational Testing Service

This chapter provides a framework for building vertical scales, for language assessments in general and for the TOEFL® Family of Assessments in particular. Topics covered include aspects of vertical scale design (growth definitions, vertical articulation, data collection), statistical methods for vertical linking, and evaluation and maintenance of the resulting vertical scale. Also discussed are challenges associated with vertical scaling, as noted in the research literature, in general and as pertains to language proficiency assessments.

Article outline

Introduction
Vertical scale design
- Growth definitions
- Vertical articulation
- Data collection design
Statistical methods for vertical linking
- Hieronymus scaling
- Thurstone scaling
- IRT scaling
- IRT scaling Decision 1: Choice of model
- IRT scaling Decision 2: Separate vs concurrent calibration
- IRT scaling Decision 3: Scores
Evaluation of a vertical scale
Maintenance of the vertical scale
Challenges with vertical scaling
Conclusion
References

Published online: 29 June 2023

https://doi.org/10.1075/illa.1.02mon

References (47)

References

Bock, R. D., & Zimowski, M. F. (1997). Multiple Group IRT. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of modern item response theory (pp. 433–448). Springer.

Braun, H. I. (1988). A new approach to avoiding problems of scale in interpreting trends in mental measurement data. Journal of Educational Measurement, 25(3), 171–191.

Briggs, D. C., & Domingue, B. (2013). The gains from vertical scaling. Journal of Educational and Behavioral Statistics, 38(6), 551–576.

Briggs, D. C., & Weeks J. P. (2009). The impact of vertical scaling decisions on growth interpretations. Educational Measurement: Issues and Practice, 28(4), 3–14.

Carlson, J. E. (2010). Statistical models for vertical linking. In A. A. von Davier (Ed.), Statistical models for test equating, scaling, and linking (pp. 59–70). Springer.

Crocker, L., & Algina, J. (1986). Introduction to modern and classical test theory. Holt, Rinehart, and Winston.

Deng, W., & Monfils, L. (2017). Long-term impact of valid case criterion on capturing population-level growth under item response theory equating (ETS Research Report Series No. RR–17–17). ETS.

Haberman, S. J. (2012). A general program for item-response analysis that employs the stabilized Newton-Raphson algorithm (Unpublished manuscript). ETS.

Haebara, T. (1980). Equating logistic ability scales by a weighted least squares method. Japanese Psychological Research, 22, 144–149.

Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Sage.

Hanson, B. A., & Beguin, A. A. (1999). Separate versus concurrent estimation of IRT item parameters in the common item equating design (ACT Research Report Series, 99–8). ACT.

(2002). Obtaining a common scale for item response theory item parameters using separate versus concurrent estimation in the common-item equating design. Applied Psychological Measurement, 26(1), 3–24.

Harris, D. J. (1991). A comparison of Angoff’s Design I and Design II for vertical equating using traditional and IRT methodology. Journal of Educational Measurement, 28(3), 221–235.

(2007). Practical issues in vertical scaling. In N. J. Dorans, M. Pommerich, & P. W. Holland (Eds.), Linking and aligning scores and scales (pp. 233–251). Springer.

Harris, D. J., & Hoover, H. D. (1987). An application of the three-parameter IRT model to vertical equating. Applied Psychological Measurement, 11(2), 151–159.

Holland, P. W. (2007). A framework and history for score linking. In N. J. Dorans, M. Pommerich, & P. W. Holland (Eds.), Linking and aligning scores and scales (pp. 5–30). Springer.

Hoskens, M., Lewis, D. M., & Patz, R. J. (2003). Maintaining vertical scalings using a common item design. Paper presented at the annual meeting of the National Council on Measurement in Education, Chicago, IL.

Ito, K., Sykes, R. C., & Yao, L. (2008). Concurrent and separate grade-groups linking procedures for vertical scaling. Applied Measurement in Education, 21(3), 187–206.

Kenyon, D. M., MacGregor, D., Li, D., & Cook, H. G. (2011). Issues in vertical scaling of a K–12 English language proficiency test. Language Testing, 28(3), 383–400.

Kim, S.-H., & Cohen, A. S. (1998). A comparison of linking and concurrent calibration under item response theory. Applied Psychological Measurement, 22(2), 131–143.

Kolen, M. J. (1981). Comparison of traditional and item response theory methods of equating tests. Journal of Educational Measurement, 18(1), 1–11.

(2006). Scaling and norming. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 156–186). American Council on Education; Praeger.

(2011). Issues associated with vertical scales for PARCC assessments. Retrieved on 6 February 2023 from [URL]

Kolen, M. J., & Brennan, R. L. (2014). Test equating, scaling, and linking: Methods and practices (3rd ed.). Springer.

Linn, R. L. (1993). Linking results of distinct assessments. Applied Measurement in Education, 6(1), 83–102.

Lord, F. M. (1975). The ‘ability’ scale in item characteristic curve theory. Psychometrika, 40(2), 205–217.

Martineau, J. A. (2006). Distorting value added: The use of longitudinal, vertically scaled student achievement data for growth-based, value-added accountability. Journal of Educational and Behavioral Statistics, 31(1), 35–62.

Masters, G. N., & Wright, B. D. (1997). The partial credit model. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of modern item response theory (pp. 101–122). Springer.

McNamara, T. F. (1996). Measuring second language performance. Longman.

Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. Applied Psychological Measurement, 16(2), 159–176.

Patz, R. J., & Yao, L. (2007). Methods and models for vertical scaling. In N. J. Dorans, M. Pommerich, & P. W. Holland (Eds.), Linking and aligning scores and scales (pp. 253–272). Springer.

Peterson, N. S., Kolen, M. J., & Hoover, H. D. (1989). Scaling, norming, and equating. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 221–262). Macmillan.

Reckase, M. D. (2009). Multidimensional item response theory. Springer.

(2010). Study of best practices for vertical scaling and standard setting with recommendations for FCAT 2.0. [URL]

Skaggs, G., & Lissitz, R. W. (1986). IRT test equating: Relevant issues and a review of recent research. Review of Educational Research, 56(4), 495–529.

(1988). Effect of examinee ability on test equating invariance. Applied Psychological Measurement, 12(1), 69–82.

Slinde, J. A., & Linn, R. L. (1979). A note on vertical equating via the Rasch model for groups of quite different ability and tests of quite different difficulty. Journal of Educational Measurement, 16, 159–165.

Stocking, M. L., & Lord, F. M. (1983). Developing a common metric in item response theory. Applied Psychological Measurement, 7(2), 201–210.

Thissen, D., Steinberg, L., & Wainer, H. (1993). Detection of differential item functioning using the parameters of item response models. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 67–113). Lawrence Erlbaum Associates.

Tomkowicz, J., Zhang, L., & Yen, S. (2010). Comparison of vertical scaling maintenance methods and their impact on scale properties. Paper presented at the annual meeting of the National Council on Measurement in Education, Denver, CO.

Tong, Y., & Kolen, M. J. (2010). Scaling: An ITEMS module. Educational Measurement: Issues and Practice, 29(4), 39–48.

von Davier, M. (2008). A general diagnostic model applied to language testing data. British Journal of Mathematical and Statistical Psychology, 61(2), 287–307.

Wu, R. Y., & Liao, C. H. (2010). Establishing a common score scale for the GEPT Elementary, Intermediate, and High-Intermediate Level listening and reading tests. In T. Kao & Y. Li (Eds.), A new look at language teaching and testing: English as subject and vehicle – Selected papers from the 2009 LTTC International Conference on English Language Teaching and Testing (pp. 309–329). Language Training and Testing Center.

Yen, W. M. (1986). The choice of scale for educational measurement: An IRT perspective. Journal of Educational Measurement, 23(4), 299–325.

(2007). Vertical scaling and No Child Left Behind. In N. J. Dorans, M. Pommerich, & P. W. Holland (Eds.), Linking and aligning scores and scales (pp. 273–283). Springer.

Yen, W. M., & Fitzpatrick, A. (2006). Item response theory. In R. L. Brennan (Ed.), Educational measurement (4th Ed.) (pp. 111–153). American Council on Education, Praeger.

Young, M. J. (2006). Vertical scales. In S. M. Downing & T. M. Haladyna (Eds.), Handbook of test development (pp. 469–485). Lawrence Erlbaum Associates.

Chapter 2Considerations in developing vertical scales for language tests

Chapter 2
Considerations in developing vertical scales for language tests