This paper explores variation in lexico-grammatical register features across text lengths in a large-scale sample of Reddit comments. Very short texts are known to be problematic for many statistical methods, so understanding their nature is important for the corpus-linguistic study of social media, where most contributions are short. I show that the frequencies of linguistic features change with comment length, even between longer comments, although longer texts are often considered similar in statistical terms. Moreover, I classify the variation found between short comments of different lengths into two main patterns, although other patterns can also be found, and there is variation even within these patterns. Furthermore, I interpret the observed differences in terms of register variation. For example, shorter comments appear to be more casual and less edited in terms of their feature makeup, whereas narrative and informational registers seem to favor longer comments.
Baroni, M. (2008). Distributions in text. In A. Lüdeling & M. Kytö (Eds.), Corpus Linguistics: An International Handbook (pp. 803–822). Mouton de Gruyter.
Baumgartner, J., Zannettou, S., Keegan, B., Squire, M., & Blackburn, J. (2020). The Pushshift Reddit Dataset. Proceedings of the International AAAI Conference on Web and Social Media,
14
(1), 830–839. [URL].
Biber, D. (1988). Variation across Speech and Writing. Cambridge University Press.
Biber, D. (1993). Representativeness in corpus design. Literary and Linguistic Computing,
8
(4), 243–257.
Biber, D. (1994). An analytical framework for register studies. In D. Biber & E. Finegan (Eds.), Sociolinguistic Perspectives on Register (pp. 31–56). Oxford University Press.
Biber, D., & Conrad, S. (2001). Introduction: Multi-dimensional analysis and the study of register variation. In S. Conrad & D. Biber (Eds.), Variation in English: Multi-Dimensional Studies (pp. 3–12). Pearson Education.
Biber, D., & Conrad, S. (2009). Register, Genre, and Style. Cambridge University Press.
Biber, D., & Egbert, J. (2016). Register variation on the searchable web: A multi-dimensional analysis. Journal of English Linguistics,
44
(2), 95–137.
Biber, D., & Egbert, J. (2018). Register Variation Online. Cambridge University Press.
Biber, D., & Gray, B. (2013). Being specific about historical change: The influence of sub-register. The Journal of English Linguistics,
41
1, 104–134.
Clarke, I., & Grieve, J. (2017). Dimensions of abusive language on Twitter. In Z. Waseem, W. Hui Kyong, D. Hovy, & J. Tetreault (Eds.), Proceedings of the First Workshop on Abusive Language Online (pp. 1–10). Association for Computational Linguistics.
Clarke, I., & Grieve, J. (2019). Stylistic variation on the Donald Trump Twitter account: A linguistic analysis of tweets posted between 2009 and 2018. PLoS ONE,
14
(9), Article e0222062.
Egbert, J., & Schnur, E. (2018). The role of text in corpus and discourse analysis. In C. Taylor & A. Marchi (Eds.), Corpus Approaches to Discourse: A Critical Review (pp. 159–173). Taylor & Francis.
Friginal, E. (Ed.) (2013). Twenty-five Years of Biber–s Multi-Dimensional Analysis [Special issue]. Corpora,
8
(
2
).
Grieve, J., Biber, D., Friginal, E., & Nekrasova, T. (2011). Variation among blog text types: A multi-dimensional analysis. In A. Mehler, S. Sharoff, & M. Santini (Eds.), Genres on the Web: Corpus Studies and Computational Models (pp. 302–322). Springer.
Hess, C. W., Haug, H. T., & Landry, R. G. (1989). The reliability of type-token ratios for the oral language of school age children. Journal of Speech and Hearing Research,
32
(3), 536–540.
Hess, C. W., Sefton, K. M., & Landry, R. G. (1986). Sample size and type-token ratios for oral language of preschool children. Journal of Speech and Hearing Research,
29
(1), 129–134.
Hiltunen, T. (2014). Choice of national variety in the English-language Wikipedia. In J. Tyrkkö & S. Leppänen (Eds.), Texts and Discourses of New Media. VARIENG. [URL]
Holler, J., Kendrick, K. H., Casillas, M., & Levinson, S. C. (2015). Editorial: Turn-taking in human communicative interaction. Frontiers in Psychology,
6
1(1919).
Koizumi, R., & In–nami, Y. (2012). Effects of text length on lexical diversity measures: Using short texts with less than 200 tokens. System,
40
(4), 554–564.
Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S. J., & McClosky, D. (2014). The Stanford CoreNLP natural language processing toolkit. In K. Bontcheva & J. Zhu (Eds.), Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations (pp. 55–60). Association for Computational Linguistics.
Rosen, A. (2017, November7). Tweeting made easier. [URL]
Titak, A., & Roberson, A. (2013). Dimensions of web registers: An exploratory multi-dimensional comparison. Corpora,
8
(2), 239–271.
Wallis, S. (2020). Statistics in Corpus Linguistic Research: A New Approach. Routledge.
Cited by (4)
Cited by four other publications
Wood, Margaret
2024. Linguistic variation in functional types of statutory law. Applied Corpus Linguistics 4:1 ► pp. 100081 ff.
Wang, Jiawei & Zhiying Xin
2023. A novel multi-dimensional analysis of reply, response and rejoinder articles: When discipline meets time. Journal of English for Academic Purposes 65 ► pp. 101286 ff.
This list is based on CrossRef data as of 11 september 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.