Pinning down text complexity
An Exploratory Study on the Registers of the Stockholm-Umeå Corpus (SUC)
In this article, we present the results of a corpus-based study where we explore whether it is possible to automatically
single out different facets of text complexity in a general-purpose corpus. To this end, we use factor analysis as applied in Biber’s
multi-dimensional analysis framework. We evaluate the results of the factor solution by correlating factor scores and readability scores to
ascertain whether the selected factor solution matches the independent measurement of readability, which is a notion tightly linked to text
complexity. The corpus used in the study is the Swedish national corpus, called Stockholm-Umeå Corpus or SUC. The SUC
contains subject-based text varieties (e.g., hobby), press genres (e.g., editorials), and mixed categories (e.g., miscellaneous). We refer
to them collectively as ‘registers’. Results show that it is indeed possible to elicit and interpret facets of text complexity using factor
analysis despite some caveats. We propose a tentative text complexity profiling of the SUC registers.
Article outline
- 1.Introduction
- 2.Text complexity and readability
- 3.Previous work
- 3.1Multi-dimensional analysis
- 3.2Readability-text complexity: Automatic approaches
- 4.Method
- 4.1The SUC corpus and dataset
- 4.2Multi-dimensional analysis: Technicalities
- 4.2.1Variable screening
- 4.2.2Running multi-dimensional analysis
- 4.2.3Three-Factors solution
- 5.Meaningful factors? Evaluation and interpretation
- 5.1Evaluation: Correlating LIX scores & factor scores
- 5.1.1Factor 1 scores & LIX scores
- 5.1.2Factor 2 scores & LIX scores
- 5.1.3Factor 3 scores & LIX scores
- 5.1.4Summary
- 5.2Interpretation: Signed dimensions & text complexity facets
- 5.2.1Factor1: Dim1+ & Dim1−
- 5.2.2Dim1+: Pronominal-Adverbial (spoken-emotional) facet – Average readability
- 5.2.3Dim1−: Nominal (informational) facet – Difficult readability
- 5.2.4Factor 2: Dim2+
- 5.2.5Dim2+: Adjectival (information elaboration) facet – Difficult readability
- 5.2.6Factor 3: Dim3+ & Dim3−
- 5.2.7Dim3+: Verbal (engaged) facet – Difficult readability
- 5.2.8Dim3−: Appositional (information expansion) facet – Difficult readability
- 5.2.9Summary
- 6.Profiling SUC registers
- 7.Discussion
- 8.Conclusion and future work
- Notes
-
Companion website
-
References
References (55)
Adesam, Y., Bouma, G. and Johansson, R.
(
2018)
The Koala part-of-speechand morphological tagset for Swedish.
SLTC.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Asención-Delaney, Y., & Collentine, J.
(
2011)
A multidimensional analysis of a written L2 Spanish corpus.
Applied linguistics, 32(3), 299–322.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Biber, D.
(
1988)
Variation across speech and writing. Cambridge University Press.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Biber, D.
(
1989)
A typology of English texts.
Linguistics, 27(1), 3–44.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Biber, D.
(
1995)
Dimensions of register variation: A cross-linguistic comparison. Cambridge University Press.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Biber, D., Johansson, S., Leech, G., Conrad, S., & Finegan, E.
(
1999)
Longman grammar of spoken and written English. Longman.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Biber, D., & Kurjian, J.
(
2007)
Towards a taxonomy of web registers and text types: A multi- dimensional analysis.
In Corpus Linguistics and the Web (pp. 109–131).
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Biber, D., & Conrad, S.
(
2009)
Register, genre, and style. Cambridge University Press.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Biber, D., & Egbert, J.
(
2016)
Register variation on the searchable web: A multi-dimensional analysis.
Journal of English Linguistics, 44(2), 95–137.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Björnsson, C. H.
(
1968) Läsbarhet. Liber.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Cattell, R. B.
(
1966)
The scree test for the number of factors. Multivariate behavioral research, 1(2), 245–276.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Common Core State Standards Initiative
(
2010)
Common Core State Standards for English Language Arts & Literacy InHistory/Social Studies, Science, and Technical Subjects. Appendix A: Research Supporting Key Elements of the Standards, Glossary of Key Terms.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Cvrček, V., Komrsková, Z., Lukeš, D., Poukarová, P., Řehořková, A., Zasina, A. J., & Benko, V.
(
2020)
Comparing web-crawled and traditional corpora.
Language Resources and Evaluation, 1–33.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Dale, E., & Chall, J. S.
(
1949)
The concept of readability.
Elementary English, 26(1), 19–26.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Dell’Orletta, F., Montemagni, S., & Venturi, G.
(
2013),
September).
Linguistic profiling of texts across textual genres and readability levels. An exploratory study on Italian fictional prose. In
Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013 (pp. 189–197).
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Dell’Orletta, F., Montemagni, S., & Venturi, G.
DiStefano, C., Zhu, M., & Mindrila, D.
(
2009)
Understanding and using factor scores: Considerations for the applied researcher.
Practical Assessment, Research & Evaluation, 14(20), 1–11.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Fahlborg, D., & Rennes, E.
(
2016)
Introducing SAPIS–an API service for text analysis and simplification. In
the second national Swe-Clarin workshop: Research collaborations for the digital age, Umeå, Sweden.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Falkenjack, J.
(
2018)
Towards a model of general text complexity for Swedish (Doctoral dissertation, Linköping University Electronic Press).
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Falkenjack, J., Mühlenbock, K. H., & Jönsson, A.
(
2013),
May).
Features indicating readability in Swedish text. In
Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013) (pp. 27–40).
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Falkenjack, J., Santini, M., & Jönsson, A.
(
2016)
An exploratory study on genre classification using readability features. In
Proceedings of the Sixth Swedish Language Technology Conference (SLTC 2016), Umeå, Sweden.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Feng, L.
(
2010)
Automatic readability assessment (Doctoral dissertation, CUNY Academic Works).
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Field, A.
(
2000)
Discovering statistics using SPSS for Windows. Londra: Sage Publication.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Flesch, R.
(
1948)
A new readibility yardstick.
Journal of Applied Psychology, 32(3):221–23.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Field, A., Miles, J., & Field, Z.
(
2012)
Discovering statistics using R. Sage publications.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Hayton, J. C., Allen, D. G., & Scarpello, V.
(
2004)
Factor retention decisions in exploratory factor analysis: A tutorial on parallel analysis.
Organizational research methods, 7(2), 191–205.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Hiebert, E. H.
(
2012)
Readability and the common core’s staircase of text complexity.
Text Matters, 11.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Horn, J. L.
(
1965)
A rationale and test for the number of factors in factor analysis.
Psychometrika 301, 179–185.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Housen, A., De Clercq, B., Kuiken, F., & Vedder, I.
(
2019)
Multiple approaches to complexity in second language research.
Second Language Research, 35(1), 3–21.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Jelen, B.
(
2013)
Excel 2013 charts and graphs. Que Publishing Company.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Jönsson, S., Rennes, E., Falkenjack, J., & Jönsson, A.
(
2018)
A component based approach to measuring text complexity. In
Proceedings of The Seventh Swedish Language Technology Conference 2018 (SLTC-18).
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Kate, R. J., Luo, X., Patwardhan, S., Franz, M., Florian, R., Mooney, R. J., & Welty, C.
(
2010),
August).
Learning to predict readability using diverse linguistic features. In
Proceedings of the 23rd international conference on computational linguistics (pp. 546–554). Association for Computational Linguistics.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Källgren, G., Gustafson-Capková, S., & Hartmann, B.
(
2006)
Manual of the Stockholm Umeå Corpus version 2.0. Department of Linguistics, Stockholm University, December.
Sofia Gustafson-Capková and
Britt Hartmann (eds.).
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Ledesma, R. D., Valero-Mora, P., & Macbeth, G.
(
2015)
The scree test and the number of factors: a dynamic graphics approach.
The Spanish journal of psychology, 181.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Mühlenbock, K. H.
(
2013)
I see what you mean: Assessing readability for specific target groups. (Doctoral dissertation, University of Gothenburg, Gothenburg, Sweden).
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Napolitano, D., Sheehan, K. M., & Mundkowsky, R.
(
2015),
June).
Online readability and text complexity analysis with Text Evaluator. In
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations (pp. 96–100).
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Nenkova, A., Chae, J., Louis, A., & Pitler, E.
(
2010)
Structural features for predicting the linguistic quality of text. In
Empirical methods in natural language generation (pp. 222–241). Springer, Berlin, Heidelberg.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Nivre, J.
(
2006)
Inductive dependency parsing (pp. 87–120). Springer Netherlands.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Pallotti, G.
(
2015)
A simple view of linguistic complexity.
Second Language Research, 31(1), 117–134.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Petersen, S.
(
2007)
Natural language processing tools for reading level assessment and text simplification for bilingual education. (Doctoral dissertation, University of Washington, Seattle, WA, USA).
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Petersen, S. E., & Ostendorf, M.
(
2009)
A machine learning approach to reading level assessment.
Computer Speech & Language, 23(1), 89–106.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Pilán, I., Vajjala, S., & Volodina, E.
(
2016)
A readable read: Automatic assessment of language learning materials based on linguistic complexity.
arXiv preprint arXiv:1603.08868.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Pitler, E., & Nenkova, A.
(
2008),
October).
Revisiting readability: A unified framework for predicting text quality. In
Proceedings of the 2008 conference on empirical methods in natural language processing (pp. 186–195).
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Rello, L., Baeza-Yates, R., Bott, S., & Saggion, H.
(
2013a)
Simplify or help? Text simplification strategies for people with dyslexia. In
Proceedings of the 10th International Cross-Disciplinary Conference on Web Accessibility (pp. 1–10).
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Rello, L., Baeza-Yates, R., Dempere-Marco, L., and Saggion, H.
(
2013b)
Frequent words improve readability and short words improve understandability for people with dyslexia. In
IFIP Conference on Human-Computer Interaction (pp. 203–219. Springer.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Saggion, H.
(
2017)
Automatic text simplification.
Synthesis Lectures on Human Language Technologies, 10(1), 1–137.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Santini, M., Danielsson, B., & Jönsson, A.
(
2019),
August).
Introducing the Notion of ‘Contrast’Features for Language Technology. In
International Conference on Database and Expert Systems Applications (pp. 189–198). Springer, Cham.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Sardinha, T. B., Kauffmann, C., & Acunzo, C. M.
(
2014)
A multi-dimensional analysis of register variation in Brazilian Portuguese.
Corpora, 9(2), 239–271.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Sardinha, T. B., & Pinto, M. V.
Štajner, S., & Saggion, H.
(
2018),
August).
Data-Driven Text Simplification. In
Proceedings of the 27th International Conference on Computational Linguistics: Tutorial Abstracts (pp. 19–23).
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Vega, B., Feng, S., Lehman, B., Graesser, A., & D’Mello, S.
(
2013),
July).
Reading into the text: Investigating the influence of text complexity on cognitive engagement. In
Educational Data Mining 2013.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Wray, D., & Janan, D.
(
2013)
Readability revisited? The implications of text complexity Published in The Curriculum Journal, 2013.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Cited by (2)
Cited by 2 other publications
Tao, Xuelian & Vahid Aryadoust
2024.
A Multidimensional Analysis of a High-Stakes English Listening Test: A Corpus-Based Approach.
Education Sciences 14:2
► pp. 137 ff.
![DOI logo](//benjamins.com/logos/doi-logo.svg)
Vahrusheva, Alexandra, Valery Solovyev, Marina Solnyshkina, Elzara Gafiaytova & Svetlana Akhtyamova
2023.
Revisiting Assessment of Text Complexity: Lexical and Syntactic Parameters Fluctuations. In
Speech and Computer [
Lecture Notes in Computer Science, 14338],
► pp. 430 ff.
![DOI logo](//benjamins.com/logos/doi-logo.svg)
This list is based on CrossRef data as of 5 july 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.