Do speech registers differ in the predictability of words?

Bentum, Martijn; ten Bosch, Louis; van den Bosch, Antal; Ernestus, Mirjam

doi:10.1075/ijcl.17062.ben

Article published In:

International Journal of Corpus Linguistics
Vol. 24:1 (2019) ► pp.98–130

Do speech registers differ in the predictability of words?

Martijn Bentum | Centre for Language Studies, Radboud University

Louis ten Bosch | Centre for Language Studies, Radboud University | Max Planck Institute for Psycholinguistics

Antal van den Bosch | Centre for Language Studies, Radboud University | KNAW Meertens Institute

Mirjam Ernestus | Centre for Language Studies, Radboud University | Max Planck Institute for Psycholinguistics

Previous research has demonstrated that language use can vary depending on the context of situation. The present paper extends this finding by comparing word predictability differences between 14 speech registers ranging from highly informal conversations to read-aloud books. We trained 14 statistical language models to compute register-specific word predictability and trained a register classifier on the perplexity score vector of the language models. The classifier distinguishes perfectly between samples from all speech registers and this result generalizes to unseen materials. We show that differences in vocabulary and sentence length cannot explain the speech register classifier’s performance. The combined results show that speech registers differ in word predictability.

Keywords: speech registers, word predictability, text classification, statistical language modelling, register analysis

Article outline

1.Introduction
2.Characterizing text in register analysis and natural language processing
3.Methodology
- 3.1Corpus
- 3.2Analysis
4.Study 1: SLM vocabulary selection
- 4.1Procedure
- 4.2Results and discussion
5.Study 2: Training and testing of the speech register classifier
- 5.1Procedure
- 5.2Results and discussion
6.Study 3: Validation of the speech register classifier
- 6.1Procedure
- 6.2Results and discussion
7.Study 4: How much text material is needed for speech register classification?
- 7.1Procedure
- 7.2Results and discussion
8.Study 5: The sentence length confound
- 8.1Procedure
- 8.2Results and discussion
9.General discussion and conclusion
Notes
References

Published online: 2 July 2019

https://doi.org/10.1075/ijcl.17062.ben

References (35)

References

Bell, A., Jurafsky, D., Fosler-Lussier, E., Girand, C., & Gildea, D. (1999). Forms of English function words-effects of disfluencies, turn position, age and sex, and predictability. In J. J. Ohala, Y. Hasegawa, M. Ohala, D. Granville & A. C. Bailey (Eds.), Proceedings of ICPHS-99 (pp. 395–398). Berkley, CA: University of California. Retrieved from [URL] (last accessed February 2019).

Van Berkum, J. J., Brown, C. M., Zwitserlood, P., Kooijman, V., & Hagoort, P. (2005). Anticipating upcoming words in discourse: Evidence from ERPs and reading times. Journal of Experimental Psychology: Learning, Memory, and Cognition, 31 (3), 443–467.

Biber, D. (1988). Variation Across Speech and Writing. New York, NY: Cambridge University Press.

(1995). Dimensions of Register Variation: A Cross-linguistic Comparison. New York, NY: Cambridge University Press.

Biber, D., & Conrad, S. (2009). Register, Genre, and Style. New York, NY: Cambridge University Press.

Chen, S. F., & Goodman, J. (1999). An empirical study of smoothing techniques for language modeling. Computer Speech & Language, 13 (4), 359–393.

Church, K. W., & Gale, W. A. (1995). Poisson mixtures. Natural Language Engineering, 1 (2), 163–190.

Denoual, E. (2006). A method to quantify corpus similarity and its application to quantifying the degree of literality in a document. International Journal of Technology and Human Interaction, 2 (1), 51–66.

Ellis, N. C. (2002). Frequency effects in language processing: A review with implications for theories of implicit and explicit language acquisition. Studies in Second Language Acquisition, 24 (2), 143–188.

Frisson, S., Rayner, K., & Pickering, M. J. (2005). Effects of contextual predictability and transitional probability on eye movements during reading. Journal of Experimental Psychology: Learning, Memory, and Cognition, 31 (5), 862–877.

Van Gijsel, S., Speelman, D., & Geeraerts, D. (2006). Locating lexical richness: A corpus linguistic, sociovariational analysis. In J. M. Viprey (Eds.), Proceedings of the 8th International Conference on the Statistical Analysis of Textual Data (pp. 961–971). Besançon: Presses universitaires de Franche-Comté. Retrieved from [URL] (last accessed February 2019).

Goedertier, W., Goddijn, S. M., & Martens, J. P. (2000). Orthographic transcription of the Spoken Dutch Corpus. In N. Calzolari, G. Carayannis, K. Choukri, H. Höge, B. Maegaard, J. Mariani, & A. Zampolli (Eds.), Proceedings of LREC-2000. Athens: ELRA. Retrieved from [URL] (last accessed February 2019).

Van Gompel, M., & van den Bosch, A. (2016). Efficient n-gram, skipgram and flexgram modelling with Colibri Core. Journal of Open Research Software, 4 (1), 1–10.

Gries, S. Th. (2001). A corpus linguistic analysis of English ic vs ical adjectives. ICAME Journal, 25 1, 65–108.

Gries, S. Th., & Ellis, N. C. (2015). Statistical measures for usage-based linguistics. Language Learning, 65 (1), 228–255.

Hlaváčová, J., & Rychlý, P. (1999). Dispersion of words in a language corpus. In V. Matousek, P. Mautner, J. Ocelíková, P. Sojka (Eds.), Text, Speech and Dialogue: Second International Workshop, TSD’99 Plzen, Czech Republic, September 13–17, 1999 Proceedings (pp. 321–324). Berlin: Springer.

Jurafsky, D., & Martin, J. H. (2009). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition (2nd ed.). Upper Saddle River, NJ: Pearson.

Kilgarriff, A. (2001). Comparing corpora. International Journal of Corpus Linguistics, 6 (1), 97–133.

Lee, D. Y. (2001). Genres, registers, text types, domains and styles: Clarifying the concepts and navigating a path through the BNC jungle. Language Learning and Technology, 5 (3), 37–72.

Leech, G. (2000). Grammars of spoken English: New outcomes of corpus-oriented research. Language Learning, 50 (4), 675–724.

Marco, J. (2000). Register analysis in literary translation: A functional approach. Fédération International des Traucteurs (FIT) Revue Babel, 46 (1), 1–19.

Miller, D., & Biber, D. (2015). Evaluating reliability in quantitative vocabulary studies: The influence of corpus design and composition. International Journal of Corpus Linguistics, 20 (1), 30–53.

Monsalve, I. F., Frank, S. L., & Vigliocco, G. (2012). Lexical surprisal as a general predictor of reading time. In W. Daelemans (Eds.), Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics (pp. 398–408). Avignon: Association for Computational Linguistics. Retrieved from [URL] (last accessed February 2019).

Oostdijk, N. (2001). The design of the Spoken Dutch Corpus. Language and Computers, 36 (1), 105–112.

Oostdijk, N., Reynaert, M., Hoste, V., & Schuurman, I. (2013). The construction of a 500-million-word reference corpus of contemporary written Dutch. In P. Spyns & J. Odijk (Eds.), Essential Speech and Language Technology for Dutch (pp. 219–247). Berlin: Springer.

Pluymaekers, M., Ernestus, M., & Baayen, R. H. (2006). Effects of word frequency on the acoustic durations of affixes. In Proceedings of Interspeech 2006 – ICSLP (pp. 953–956). Pittsburgh, PA: International Speech Communication Association. Retrieved from [URL] (last accessed February 2019).

Rayson, P., & Garside, R. (2000). Comparing corpora using frequency profiling. In A. Kilgarriff & T. Berber Sardinha (Eds.), Proceedings of the Workshop on Comparing Corpora of ACL 2000 (pp. 1–6). Hong Kong: Association for Computational Linguistics. Retrieved from [URL] (last accessed February 2019).

Savický, P., & Hlavácová, J. (2002). Measures of word commonness. Journal of Quantitative Linguistics, 9 (3), 215–231.

Schmitt, N. (2010). Researching Vocabulary: A Vocabulary Research Manual. New York, NY: Palgrave Macmillan.

Smith, N. J., & Levy, R. (2013). The effect of word predictability on reading time is logarithmic. Cognition, 128 (3), 302–319.

Van Son, R., Wesseling, W., Sanders, E., & van den Heuvel, H. (2008). The IFADV Corpus: A Free Dialog Video Corpus. In N. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis & D. Tapias (Eds.), LREC (pp. 501–508). Marrakech: ELRA. Retrieved from [URL] (last accessed February 2019).

Stolcke, A. (2002). SRILM-an extensible language modelling toolkit. In J. H. L. Hansen & B. L. Pellom (Eds.), Proceedings of the International Conference on Spoken Language Processing. Denver, CO: International Speech Communication Association. Retrieved from [URL] (last accessed February 2019).

Tottie, G. (1991). Negation in English Speech and Writing: A Study in Variation. San Diego, CA: Academic Press.

Willems, R. M., Frank, S. L., Nijhof, A. D., Hagoort, P., & van den Bosch, A. (2016). Prediction during natural language comprehension. Cerebral Cortex, 26 (6), 2506–2516.

Witten, I. H., & Bell, T. C. (1991). The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression. IEEE Transactions on Information Theory, 37 (4), 1085–1094.

Cited by (3)

Cited by three other publications

Chia, Katherine, Ashley A. Edwards, Christopher Schatschneider & Michael P. Kaschak

2023. Structural repetition in responses to indirect requests. Discourse Processes 60:9 ► pp. 634 ff.

Jacobs, Cassandra L. & Maryellen C. MacDonald

2023. A chimpanzee by any other name: The contributions of utterance context and information density on word choice. Cognition 230 ► pp. 105265 ff.

Bentum, M., L. ten Bosch, A van den Bosch & M. Ernestus

2022. Speech register influences listeners’ word expectations. Brain and Language 235 ► pp. 105197 ff.

This list is based on CrossRef data as of 4 july 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.