Vol. 2:2 (2020) ► pp.306–349
Pinning down text complexity
An Exploratory Study on the Registers of the Stockholm-Umeå Corpus (SUC)
In this article, we present the results of a corpus-based study where we explore whether it is possible to automatically single out different facets of text complexity in a general-purpose corpus. To this end, we use factor analysis as applied in Biber’s multi-dimensional analysis framework. We evaluate the results of the factor solution by correlating factor scores and readability scores to ascertain whether the selected factor solution matches the independent measurement of readability, which is a notion tightly linked to text complexity. The corpus used in the study is the Swedish national corpus, called Stockholm-Umeå Corpus or SUC. The SUC contains subject-based text varieties (e.g., hobby), press genres (e.g., editorials), and mixed categories (e.g., miscellaneous). We refer to them collectively as ‘registers’. Results show that it is indeed possible to elicit and interpret facets of text complexity using factor analysis despite some caveats. We propose a tentative text complexity profiling of the SUC registers.
Article outline
- 1.Introduction
- 2.Text complexity and readability
- 3.Previous work
- 3.1Multi-dimensional analysis
- 3.2Readability-text complexity: Automatic approaches
- 4.Method
- 4.1The SUC corpus and dataset
- 4.2Multi-dimensional analysis: Technicalities
- 4.2.1Variable screening
- 4.2.2Running multi-dimensional analysis
- 4.2.3Three-Factors solution
- 5.Meaningful factors? Evaluation and interpretation
- 5.1Evaluation: Correlating LIX scores & factor scores
- 5.1.1Factor 1 scores & LIX scores
- 5.1.2Factor 2 scores & LIX scores
- 5.1.3Factor 3 scores & LIX scores
- 5.1.4Summary
- 5.2Interpretation: Signed dimensions & text complexity facets
- 5.2.1Factor1: Dim1+ & Dim1−
- 5.2.2Dim1+: Pronominal-Adverbial (spoken-emotional) facet – Average readability
- 5.2.3Dim1−: Nominal (informational) facet – Difficult readability
- 5.2.4Factor 2: Dim2+
- 5.2.5Dim2+: Adjectival (information elaboration) facet – Difficult readability
- 5.2.6Factor 3: Dim3+ & Dim3−
- 5.2.7Dim3+: Verbal (engaged) facet – Difficult readability
- 5.2.8Dim3−: Appositional (information expansion) facet – Difficult readability
- 5.2.9Summary
- 5.1Evaluation: Correlating LIX scores & factor scores
- 6.Profiling SUC registers
- 7.Discussion
- 8.Conclusion and future work
- Notes
-
Companion website -
References
References
Companion website
The study described in this paper is fully reproducible. Datasets, radar charts and R code are available here: <[URL]>.