Researchers have increasingly turned to Amazon Mechanical Turk (AMT) to crowdsource speech data, predominantly in
English. Although AMT and similar platforms are well positioned to enhance the state of the art in L2 research, it is unclear if
crowdsourced L2 speech ratings are reliable, particularly in languages other than English. The present study describes the
development and deployment of an AMT task to crowdsource comprehensibility, fluency, and accentedness ratings for L2 Spanish
speech samples. Fifty-four AMT workers who were native Spanish speakers from 11 countries participated in the ratings. Intraclass
correlation coefficients were used to estimate group-level interrater reliability, and Rasch analyses were undertaken to examine
individual differences in rater severity and fit. Excellent reliability was observed for the comprehensibility and fluency
ratings, but indices were slightly lower for accentedness, leading to recommendations to improve the task for future data
collection.
Akiyama, Y., & Saito, K. (2017). Development of comprehensibility and its linguistic correlates: A longitudinal study of video-mediated telecollaboration. The Modern Language Journal, 100(3), 585–609.
Bergeron, A., & Trofimovich, P. (2017). Linguistic dimensions of accentedness and comprehensibility: Exploring task and listener effects in second language French. Foreign Language Annals, 50(3), 547–566.
Buhrmester, M., Kwang, T., & Gosling, S. D. (2011). Amazon’s Mechanical Turk: A new source of inexpensive, yet high-quality, data?Perspectives on Psychological Science, 6(1), 3–5.
Crowther, D., Trofimovich, P., Isaacs, T., & Saito, K. (2015). Does a speaking task affect second language comprehensibility?The Modern Language Journal, 99(1), 80–95.
Crowther, D., Trofimovich, P., Saito, K., & Isaacs, T. (2018). Linguistic dimensions of L2 accentedness and comprehensibility vary across speaking tasks. Studies in Second Language Acquisition, 40(2), 443–457.
Derwing, T. M., & Munro, M. J. (2013). The development of L2 oral language skills in two L1 groups: A 7-year study. Language Learning, 63(2), 163–185.
Eckes, T. (2005). Examining rater effects in TestDaF writing and speaking performance assessments: A many-facet Rasch analysis. Language Assessment Quarterly, 2(3), 197–221.
Eckes, T. (2015). Introduction to many-facet Rasch measurement. New York: Peter Lang.
Eskénazi, M., Levow, G.-A., Meng, H., Parent, G., & Suendermann, D. (Eds.). (2013). Crowdsourcing for speech processing: Applications to data collection, transcription and assessment. UK: John Wiley & Sons.
Evanini, K., Higgins, D., & Zechner, K. (2010). Using Amazon Mechanical Turk for transcription of non-native speech. Paper presented at the Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, Los Angeles, CA.
Flege, J. E., & Fletcher, K. L. (1992). Talker and listener effects on degree of perceived foreign accent. The Journal of the Acoustical Society of America, 91(1), 370–389.
Fort, K., Adda, G., & Bretonnel Cohen, K. (2011). Amazon Mechanical Turk: Gold mine or coal mine?Computational Linguistics, 37(2), 413–420.
Gelas, H., Teferra Abate, S., Besacier, L., & Pellegrino, F. (2011). Quality assessment of crowdsourcing transcriptions for African languages Interspeech-2011 (pp. 3065–3068).
Goodman, J. K., Cryder, C. E., & Cheema, A. (2013). Data collection in a flat world: The strengths and weaknesses of Mechanical Turk samples. Journal of Behavioral Decision Making, 26(3), 213–224.
Hallgren, K. A. (2012). Computing inter-rater reliability for observational data: An overview and tutorial. Tutorials in Quantitative Methods for Psychology, 8(1), 23–34.
Isaacs, T., & Thomson, R. I. (2013). Rater experience, rating scale length, and judgments of L2 pronunciation: Revisiting research conventions. Language Assessment Quarterly, 10(2), 135–159.
Kennedy, S., Foote, J. A., & Dos Santos Buss, L. K. (2015). Second language speakers at university: Longitudinal development and rater behaviour. TESOL Quarterly, 49(1), 199–209.
Kunath, S. A., & Weinberger, S. H. (2010). The wisdom of the crowd’s ear: Speech accent rating and annotation with Amazon Mechanical TurkProceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk (pp. 168–171). Los Angeles, CA: Association for Computational Linguistics.
Linacre, J. M. (2002). What do infit and outfit, mean-square and standardized mean?Rasch Measurement Transactions, 16(2), 878.
Martin, D., Hanrahan, B. V., O’Neill, J., & Gupta, N. (2014). Being a turker. Paper presented at the 17th ACM Conference on Computer Supported Cooperative Work and Social Computing, Baltimore, MD.
McAllister Byun, T., Halpin, P. F., & Szeredi, D. (2015). Online crowdsourcing for efficient rating of speech: A validation study. Journal of Communication Disorders, 531, 70–83.
McGraw, K. O., & Wong, S. P. (1996). Forming inferences about some intraclass correlation coefficients. Psychological Methods, 1(1), 30–46.
Muñoz, C. (Ed.) (2006). Age and the rate of foreign language learning. Tonawanda, NY: Multilingual Matters.
Munro, M. J., & Derwing, T. M. (1995). Foreign accent, comprehensibility, and intelligibility in the speech of second language learners. Language Learning, 45(1), 73–97.
Myford, C. M., & Wolfe, E. W. (2003). Detecting and measuring rater effects using many-facet Rasch measurement: Part 1. Journal of Applied Measurement, 4(4), 386–422.
Nagle, C. (2018a). Modeling classroom language learners’ comprehensibility and accentedness over time: The case of L2 Spanish. In J. Levis (Ed.), Proceedings of the 9th Pronunciation in Second Language Learning and Teaching Conference (pp. 17–29). Ames, IA: Iowa State University.
Nagle, C. (2018b). Motivation, comprehensibility, and accentedness in L2 Spanish: Investigating motivation as a time-varying predictor of pronunciation development. The Modern Language Journal, 102(1), 199–217.
O’Brien, M. G. (2014). L2 learners’ assessments of accentedness, fluency, and comprehensibility of native and nonnative German speech. Language Learning, 64(4), 715–748.
O’Brien, M. G. (2016). Methodological choices in rating speech samples. Studies in Second Language Acquisition, 38(3), 587–605.
Paolacci, G., & Chandler, J. (2014). Inside the Turk. Current Directions in Psychological Science, 23(3), 184–188.
Paolacci, G., Chandler, J., & Ipeirotis, P. G. (2010). Running experiments on Amazon Mechanical Turk. Judgment and Decision Making, 5(5), 411–419.
Pavlick, E., Post, M., Irvine, A., Kachaev, D., & Callison-Burch, C. (2014). The language demographics of Amazon Mechanical Turk. Transactions of the Association for Computational Linguistics (Vol. 21, pp. 79–92).
Peabody, M. A. (2011). Methods for pronunciation assessment in computer aided language learning (Unpublished doctoral dissertation). Massachusetts Institute of Technology, Cambridge, MA.
Peer, E., Vosgerau, J., & Acquisti, A. (2014). Reputation as a sufficient condition for data quality on Amazon Mechanical Turk. Behavioral Research Methods, 46(4), 1023–1031.
Ross, J., Irani, L., Silberman, M. S., Zaldivar, A., & Tomlinson, B. (2010). Who are the crowdworkers? Shifting demographics in mechanical turk. Paper presented at the CHI ’10 Extended Abstracts on Human Factors in Computing Systems, Atlanta, GA.
Saito, K., Dewaele, J.-M., Abe, M., & In’nami, Y. (2018). Motivation, emotion, learning experience, and second language comprehensibility development in classroom settings: A cross-sectional and longitudinal study. Language Learning, 68(3), 709–743.
Saito, K., Trofimovich, P., & Isaacs, T. (2017). Using listener judgments to investigate linguistic influences on L2 comprehensibility and accentedness: A validation and generalization study. Applied Linguistics, 38(4), 439–462.
Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86(2), 420–428.
Trofimovich, P., & Isaacs, T. (2012). Disentangling accent from comprehensibility. Bilingualism: Language and Cognition, 15(4), 905–916.
Wang, H., Qian, X., & Meng, H. (2013). Predicting gradation of L2 English mispronunciations using crowdsourced ratings and phonological rules. In P. Badin, T. Hueber, G. Bailly, D. Demolin, & F. Raby (Eds.), Proceedings of Speech and Language Technology in Education (SLaTE 2013) (pp. 127–131). Grenoble, France.
Wu, M., & Adams, R. J. (2013). Properties of Rasch residual fit statistics. Journal of Applied Measurement, 141, 339–355.
Kim, Kathy Minhye, Xiaoyi Liu, Daniel R. Isbell & Xiaobin Chen
2024. A comparison of Lab- and Web-based elicited imitation: Insights from explicit-implicit L2 grammar knowledge and L2 proficiency. Studies in Second Language Acquisition► pp. 1 ff.
Tsunemoto, Aki & Pavel Trofimovich
2024. Coherence and Comprehensibility in Second Language Speakers’ Academic Speaking Performance. Studies in Second Language Acquisition► pp. 1 ff.
Dalman, Mohammadreza & Okim Kang
2023. VALIDITY EVIDENCE: UNDERGRADUATE STUDENTS’ PERCEPTIONS OF TOEFL IBT HIGH SCORE SPOKEN RESPONSES. International Journal of Listening 37:2 ► pp. 113 ff.
Gallant, Jordan
2023. Typed transcription as a simultaneous measure of foreign-accent comprehensibility and intelligibility: An online replication study. Research Methods in Applied Linguistics 2:2 ► pp. 100055 ff.
Nagle, Charlie, Pavel Trofimovich, Oguzhan Tekin & Kim McDonough
2023. Framing second language comprehensibility: Do interlocutors’ ratings predict their perceived communicative experience?. Applied Psycholinguistics 44:1 ► pp. 131 ff.
Olson, Daniel J.
2023. Measuring bilingual language dominance: An examination of the reliability of the Bilingual Language Profile. Language Testing 40:3 ► pp. 521 ff.
2023. Pre-service teachers’ beliefs about second language pronunciation teaching, their experience, and speech assessments. Language Teaching Research 27:1 ► pp. 115 ff.
Tsunemoto, Aki, Mark McAndrews, Pavel Trofimovich & Eric Friginal
Tsunemoto, Aki, Pavel Trofimovich, Josée Blanchet, Juliane Bertrand & Sara Kennedy
2022. Effects of benchmarking and peer‐assessment on French learners' self‐assessments of accentedness, comprehensibility, and fluency. Foreign Language Annals 55:1 ► pp. 135 ff.
Huensch, Amanda & Charlie Nagle
2021. The Effect of Speaker Proficiency on Intelligibility, Comprehensibility, and Accentedness in L2 Spanish: A Conceptual Replication and Extension of Munro and Derwing (1995a). Language Learning 71:3 ► pp. 626 ff.
2021. DOING L2 SPEECH RESEARCH ONLINE: WHY AND HOW TO COLLECT ONLINE RATINGS DATA. Studies in Second Language Acquisition 43:4 ► pp. 916 ff.
Saito, Kazuya, Yui Suzukida, Mai Tran & Adam Tierney
2021. Domain‐General Auditory Processing Partially Explains Second Language Speech Learning in Classroom Settings: A Review and Generalization Study. Language Learning 71:3 ► pp. 669 ff.
Kobayashi, Aozora, Ian Wilson & D. Roy
2020. Using deep learning to classify English native pronunciation level from acoustic information. SHS Web of Conferences 77 ► pp. 02004 ff.
This list is based on CrossRef data as of 20 october 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.