Chapter published in:
Corpus Approaches to Social MediaEdited by Sofia Rüdiger and Daria Dayter
[Studies in Corpus Linguistics 98] 2020
► pp. 111–130
Chapter 5Using lengthwise scaling to compare feature frequencies across text lengths on Reddit
Aatu Liimatta | University of Helsinki
Texts of different lengths can be difficult to compare using quantitative methods. This is particularly true if many of the texts are extremely short, as is commonly the case with social media comments, where the median text length may be only a few dozen words. In this paper, I explore lengthwise scaling, that is, scaling applied to each text length separately, as a possible approach for getting around some of the statistical problems caused by different text lengths and short texts. I describe two implementations of this family of methods, lengthwise rarity scaling and lengthwise quantile scaling. I show in an exploratory analysis that these scaling methods support earlier results in terms of register differences between Reddit subreddits.
Keywords: text length, short texts, register, scaling, social media
Article outline
- 1.Introduction
- 2.Background
- 3.Related research
- 4.Lengthwise approaches
- 5.Register
- 6.Data
- 7.Case studies
- 7.1Baseline: Normalization
- 7.2Method 1: Lengthwise rarity scaling
- 7.3Method 2: Lengthwise quantile scaling
- 8.Discussion
- 9.Conclusion
-
Notes -
References
Published online: 04 November 2020
https://doi.org/10.1075/scl.98.05lii
https://doi.org/10.1075/scl.98.05lii
References
Baumgartner, Jason
n.d. Reddit Comment Corpus. pushshift.io (27 March 2020).
Berber Sardinha, Tony
Biber, Douglas & Egbert, Jesse
Clarke, Isobelle & Grieve, Jack
Conrad, Susan & Biber, Douglas
Covington, Michael A. & McFall, Joe D.
Eisenstein, Jacob
Francis, W. Nelson & Kučera, Henry
Hess, Carla W., Sefton, Karem M. & Landry, Richard G.
Hess, Carla W., Haug, Holly T. & Landry, Richard G.
Hiltunen, Turo
2014 Choice of national variety in the English-language Wikipedia. In Texts and Discourses of New Media, Jukka Tyrkkö & Sirpa Leppänen (eds), n.p. Helsinki: VARIENG. http://www.helsinki.fi/varieng/series/volumes/15/hiltunen/ (8 June 2020).
Koizumi, Rie & In’nami, Yo
Kubát, Miroslav & Milička, Jiří
Liimatta, Aatu
Rosen, Aliza
2017 Tweeting made easier. Twitter Blog, 7 November 2017, https://blog.twitter.com/en_us/topics/product/2017/tweetingmadeeasier.html (5 February 2020).
Titak, Ashley & Roberson, Audrey