Chapter 5. Using lengthwise scaling to compare feature frequencies across text lengths on Reddit

Liimatta, Aatu

doi:10.1075/scl.98.05lii

Part of

Corpus Approaches to Social Media
Edited by Sofia Rüdiger and Daria Dayter
[Studies in Corpus Linguistics 98] 2020
► pp. 111–130

Chapter 5
Using lengthwise scaling to compare feature frequencies across text lengths on Reddit

Aatu Liimatta | University of Helsinki

Texts of different lengths can be difficult to compare using quantitative methods. This is particularly true if many of the texts are extremely short, as is commonly the case with social media comments, where the median text length may be only a few dozen words. In this paper, I explore lengthwise scaling, that is, scaling applied to each text length separately, as a possible approach for getting around some of the statistical problems caused by different text lengths and short texts. I describe two implementations of this family of methods, lengthwise rarity scaling and lengthwise quantile scaling. I show in an exploratory analysis that these scaling methods support earlier results in terms of register differences between Reddit subreddits.

Keywords: text length, short texts, register, scaling, social media

Article outline

1.Introduction
2.Background
3.Related research
4.Lengthwise approaches
5.Register
6.Data
7.Case studies
- 7.1Baseline: Normalization
- 7.2Method 1: Lengthwise rarity scaling
- 7.3Method 2: Lengthwise quantile scaling
8.Discussion
9.Conclusion
Notes
References

Published online: 4 November 2020

https://doi.org/10.1075/scl.98.05lii

References (22)

References

Baumgartner, Jason. n.d. Reddit Comment Corpus. <[URL]> (27 March 2020).

Berber Sardinha, Tony. 2014. Comparing internet and pre-internet registers. In Multi-dimensional Analysis, 25 Years on: A Tribute to Douglas Biber [Studies in Corpus Linguistics 60], Tony Berber-Sardinha & Marcia Veirano-Pinto (eds), 81–105. Amsterdam: John Benjamins.

Biber, Douglas. 1988. Variation across Speech and Writing. Cambridge: CUP.

. 1992. The multi-dimensional approach to linguistic analyses of genre variation: An overview of methodology and findings. Computers and the Humanities 26(5–6): 331–345.

. 1993. Representativeness in corpus design. Literary and Linguistic Computing 8(4): 243–257.

. 2014. Using multi-dimensional analysis to explore cross-linguistic universals of register variation. Languages in Contrast 14(1): 7–34.

Biber, Douglas & Egbert, Jesse. 2016. Register variation on the searchable web: A multi-dimensional analysis. Journal of English Linguistics 44(2): 95–137.

Clarke, Isobelle & Grieve, Jack. 2017. Dimensions of abusive language on Twitter. In Proceedings of the First Workshop on Abusive Language Online, Zeerak Waseem, Wendy Hui Kyong, Dirk Hovy & Joel Tetreault (eds), 1–10. Vancouver BC: Association for Computational Linguistics.

. 2019. Stylistic variation on the Donald Trump Twitter account: A linguistic analysis of tweets posted between 2009 and 2018. PLoS ONE 14(9).

Conrad, Susan & Biber, Douglas. 2001. Variation in English: Multi-dimensional Studies. Eastbourne: Pearson Education.

Covington, Michael A. & McFall, Joe D. 2010. Cutting the Gordian Knot: The Moving-Average Type-Token Ratio (MATTR). Journal of Quantitative Linguistics 17(2): 94–100.

Eisenstein, Jacob. 2013. What to do about bad language on the internet. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), 359–369.

Francis, W. Nelson & Kučera, Henry. 1964. A Standard Corpus of Present-Day Edited American English, for Use with Digital Computers (Brown). Providence, RI: Brown University.

Hess, Carla W., Sefton, Karem M. & Landry, Richard G. 1986. Sample size and type-token ratios for oral language of preschool children. Journal of Speech and Hearing Research 29: 129–134.

Hess, Carla W., Haug, Holly T. & Landry, Richard G. 1989. The reliability of type-token ratios for the oral language of school age children. Journal of Speech and Hearing Research 32: 536–540.

Hiltunen, Turo. 2014. Choice of national variety in the English-language Wikipedia. In Texts and Discourses of New Media, Jukka Tyrkkö & Sirpa Leppänen (eds), n.p. Helsinki: VARIENG. <[URL]> (8 June 2020).

Koizumi, Rie & In’nami, Yo. 2012. Effects of text length on lexical diversity measures: Using short texts with less than 200 tokens. System 40(4): 554–564.

Kubát, Miroslav & Milička, Jiří. 2013. Vocabulary richness measure in genres. Journal of Quantitative Linguistics 20(4): 339–349.

Liimatta, Aatu. 2019. Exploring register variation on Reddit: A multi-dimensional study of language use on a social media website. Register Studies 1(2): 269–295.

Rosen, Aliza. 2017. Tweeting made easier. Twitter Blog, 7 November 2017, <[URL]> (5 February 2020).

Titak, Ashley & Roberson, Audrey. 2013. Dimensions of web registers: An exploratory multi-dimensional comparison. Corpora 8(2): 239–271.

Vitter, Jeffrey Scott. 1985. Random sampling with a reservoir. ACM Transactions on Mathematical Software 11(1): 37–57.

Cited by (4)

Cited by four other publications

Order by:

Heaton, Dan, Elena Nichele, Jeremie Clos, Joel E. Fischer & Michal Ptaszynski

2023. “The algorithm will screw you”: Blame, social actors and the 2020 A Level results algorithm on Twitter. PLOS ONE 18:7 ► pp. e0288662 ff.

Clarke, Isobelle

2022. Register and social media. Register Studies 4:2 ► pp. 133 ff.

Liimatta, Aatu

2022. Do registers have different functions for text length?. Register Studies 4:2 ► pp. 263 ff.

Liimatta, Aatu

2023. Register variation across text lengths. International Journal of Corpus Linguistics 28:2 ► pp. 202 ff.

This list is based on CrossRef data as of 19 july 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.

Chapter 5Using lengthwise scaling to compare feature frequencies across text lengths on Reddit

Cited by four other publications

Chapter 5
Using lengthwise scaling to compare feature frequencies across text lengths on Reddit