Edited by Sofia Rüdiger and Daria Dayter
[Studies in Corpus Linguistics 98] 2020
► pp. 111–130
Texts of different lengths can be difficult to compare using quantitative methods. This is particularly true if many of the texts are extremely short, as is commonly the case with social media comments, where the median text length may be only a few dozen words. In this paper, I explore lengthwise scaling, that is, scaling applied to each text length separately, as a possible approach for getting around some of the statistical problems caused by different text lengths and short texts. I describe two implementations of this family of methods, lengthwise rarity scaling and lengthwise quantile scaling. I show in an exploratory analysis that these scaling methods support earlier results in terms of register differences between Reddit subreddits.