Chapter published in:Corpus Approaches to Social Media
Edited by Sofia Rüdiger and Daria Dayter
[Studies in Corpus Linguistics 98] 2020
► pp. 111–130
Using lengthwise scaling to compare feature frequencies across text lengths on Reddit
Texts of different lengths can be difficult to compare using quantitative methods. This is particularly true if many of the texts are extremely short, as is commonly the case with social media comments, where the median text length may be only a few dozen words. In this paper, I explore lengthwise scaling, that is, scaling applied to each text length separately, as a possible approach for getting around some of the statistical problems caused by different text lengths and short texts. I describe two implementations of this family of methods, lengthwise rarity scaling and lengthwise quantile scaling. I show in an exploratory analysis that these scaling methods support earlier results in terms of register differences between Reddit subreddits.
Keywords: text length, short texts, register, scaling, social media
Published online: 04 November 2020
n.d. Reddit Comment Corpus. pushshift.io (27 March 2020).
Berber Sardinha, Tony
Biber, Douglas & Egbert, Jesse
Clarke, Isobelle & Grieve, Jack
Conrad, Susan & Biber, Douglas
Covington, Michael A. & McFall, Joe D.
Francis, W. Nelson & Kučera, Henry
Hess, Carla W., Sefton, Karem M. & Landry, Richard G.
Hess, Carla W., Haug, Holly T. & Landry, Richard G.
2014 Choice of national variety in the English-language Wikipedia. In Texts and Discourses of New Media, Jukka Tyrkkö & Sirpa Leppänen (eds), n.p. Helsinki: VARIENG. http://www.helsinki.fi/varieng/series/volumes/15/hiltunen/ (8 June 2020).
Koizumi, Rie & In’nami, Yo
Kubát, Miroslav & Milička, Jiří
2017 Tweeting made easier. Twitter Blog, 7 November 2017, https://blog.twitter.com/en_us/topics/product/2017/tweetingmadeeasier.html (5 February 2020).
Titak, Ashley & Roberson, Audrey