Assessing document and sentence readability in less resourced languages and across textual genres

Dell’Orletta, Felice; Montemagni, Simonetta; Venturi, Giulia

doi:10.1075/itl.165.2.03del

Article published In:

Recent Advances in Automatic Readability Assessment and Text Simplification
Edited by Thomas François and Delphine Bernhard
[ITL - International Journal of Applied Linguistics 165:2] 2014
► pp. 163–193

Assessing document and sentence readability in less resourced languages and across textual genres

Felice Dell’Orletta | Istituto di Linguistica Computazionale “Antonio Zampolli” (ILC-CNR), Pisa Italy

Simonetta Montemagni

Giulia Venturi

In this paper, we tackle three underresearched issues of the automatic readability assessment literature, namely the evaluation of text readability in less resourced languages, with respect to sentences (as opposed to documents) as well as across textual genres. Different solutions to these issues have been tested by using and refining READ‑IT, the first advanced readability assessment tool for Italian, which combines traditional raw text features with lexical, morpho-syntactic and syntactic information. In READ‑IT readability assessment is carried out with respect to both documents and sentences, with the latter constituting an important novelty of the proposed approach: READ‑IT shows a high accuracy in the document classification task and promising results in the sentence classification scenario. By comparing the results of two versions of READ‑IT, adopting a classification‑ versus ranking-based approach, we also show that readability assessment is strongly influenced by textual genre; for this reason a genre-oriented notion of readability is needed. With classification-based approaches, reliable results can only be achieved with genre-specific models: Since this is far from being a workable solution, especially for less resourced languages, a new ranking method for readability assessment is proposed, based on the notion of distance.

Keywords: classification, less resourced languages, readability, textual genres, multi-level linguistic annotation, ranking

Published online: 23 January 2015

https://doi.org/10.1075/itl.165.2.03del

References (61)

Aluisio, S., Specia, L., Gasperin, C., & Scarton, C

(2010) Readability assessment for text simplification. In J. Tetreault, J. Burstein & C. Leacock (Eds.), Proceedings of the NAACL HLT 2010 Fifth Workshop on Innovative Use of NLP for Building Educational Applications (pp. 1–9). Los Angeles, California: Association for Computational Linguistics.

Attardi, G

(2006) Experiments with a multilanguage non-projective dependency parser. In L. Màrquez & D. Klein (Eds.), Proceedings of the Tenth Conference on Computational Natural Language Learning (CoNLL-X ’06) (pp. 1–9). New York City: Association for Computational Linguistics.[URL].

Barzilay, R., & Lapata, M

(2008) Modeling local coherence: An entity-based approach. Computational Linguistics, 34(1), 1–34.

Beinborn, L., Zesch, T., & Gurevych, I

(2012) Towards fine-grained readability measures for self-directed language learning. Proceedings of the SLTC 2012 Workshop on NLP for CALL, (Vol. 21, pp. 11–19). Lund (Sweden): Öping University Electronic Press.

Biber, D., & Conrad, S

(2009) Register, genre, and style. Cambridge: Cambridge University Press.

Bormuth, J.R

(1966) Readability: A new approach. Reading Research Quarterly, 11, 79–132.

Bowers, J.S

(2000) In defense of abstractionist theories of repetition priming and word identification. Psychonomic Bulletin & Review, 71, 83–99.

Caldwell, B., Cooper, M., Guarino Reid, L., & Vanderheiden, G

(Eds.) (2008) Web Content Accessibility Guidelines 2.0. World Wide Web Consortium, Recommendation REC-WCAG20-20081211, (December 2008), [URL].

Carreiras, M., Carriedo, N., Alonso, M.A., & Fernández, A

(1997) The role of verb tense and verb aspect in the foregrounding of information during reading. Memory & Cognition, 25(4), 438–446.

Chall, J.S., & Dale, E

(1995) Readability revisited: The new Dale-Chall readability formula. Cambridge, MA: Brookline Books.

Chang, C-C., & Lin, C-J

(2001) LIBSVM: A library for support vector machines. Software available at [URL].

Collins-Thompson, K., & Callan, J

(2004) A language modeling approach to predicting reading difficulty. Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL 2004) (pp. 193–200). Boston, Massachusetts, USA: Association for Computational Linguistics.

De Belder, J., & Moens, M-F

(2010) Text simplification for children. Proceedings of the SIGIR Workshop on Accessible Search Systems (pp. 19–26). New York: ACM.

De Mauro, T

(2000) Il dizionario della lingua italiana. Paravia: Torino.

Dell’Orletta, F

(2009) Ensemble system for Part-of-Speech tagging. Poster and Workshop Proceedings of the 11th Conference of the Italian Association for Artificial Intelligence , 12th December 2009, Reggio Emilia, Italy, ISBN 978-88-903581-1-1.

Dell’Orletta, F., Montemagni, S., & Venturi, G

(2011b) READ‑IT: Assessing readability of italian texts with a view to text simplification. In N. Alm (Ed.), Proceedings of the Second Workshop on “Speech and Language Processing for Assistive Technologies” (SLPAT 2011) (pp. 73–83). 30 July 2011, Edinburgh, UK. Edinburgh, Scotland, UK: Association for Computational Linguistics.

(2012) Genre-oriented Readability Assessment: A Case Study. In R. Mamidi & K. Prahallad (Eds.), Proceedings of the COLING-2012 Workshop on Speech and Language Processing Tools in Education (SLP-TED) (pp. 91–98). 15 December 2012, Mumbai, India.

Dell’Orletta, F., Montemagni, S., Vecchi, E.M., & Venturi, G

(2011a) Tecnologie linguistico-computazionali per il monitoraggio della competenza linguistica italiana degli alunni stranieri nella scuola primaria e secondaria. In G.C. Bruno, I. Caruso, M. Sanna & I. Vellecco (Eds.), Percorsi migranti: uomini, diritto, lavoro, linguaggi (pp. 319–366). Milano: McGraw-Hill Editore.

Drndarević, B., Štajner, S., Bott, S., Bautista, S., & Saggion, H

(2013) Automatic text simplification in Spanish: A comparative evaluation of complementing modules. In A. Gelbukh (Ed.), Proceedings of the Computational Linguistics and Intelligent Text Processing – 14th International Conference, CICLing 2013, Samos, Greece, March 24–30, 2013, Part II (pp. 488–500). Berlin Heidelberg: Springer-Verlag, LNCS 7817.

Falkenjack, J., Mühlenbock, K.H., & Jönsson, A

(2013) Features indicating readability in Swedish text. Proceedings of the 19th Nordic Conference of Computational Linguistics , (pp. 27–40).

Feng, L., Elhadad, N., & Huenerfauth, M

(2009) Cognitively motivated features for readability assessment. Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics (EACL ’09) (pp. 229–237).

Feng, L., Jansche, M., Huenerfauth, M., & Elhadad, N

(2010) A comparison of features for automatic readability assessment. Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010) , (pp. 276–284).

Franchina, V., & Vacca, R

(1986) Adaptation of Flesh readability index on a bilingual text written by the same author both in Italian and English languages. Linguaggi, 31, 47–49.

François, T., & Fairon, C

(2012) An “AI readability” formula for French as a foreign language. Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (pp. 466–477). Jeju Island, Korea.

Frazier, L

(1985) Syntactic complexity. In D.R. Dowty, L. Karttunen & A.M. Zwicky (Eds.), Natural language parsing. Cambridge, UK: Cambridge University Press.

Gibson, E

(1998) Linguistic complexity: Locality of syntactic dependencies. Cognition, 68(1), 1–76.

Hancke, J., Vajjala, S., & Meurers, D

(2012) Readability classification for German using lexical, syntactic, and morphological features. Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012) (pp. 1063–1080). Mumbai, India.

Heilman, M.J., Collins, K., & Callan, J

(2007) Combining lexical and grammatical features to improve readability measures for first and second language texts. Proceedings of the Human Language Technology Conference (pp. 460–467).

Inui, K., & Yamamoto, S

(2001) Corpus-based acquisition of sentence readability ranking models for deaf people. Proceedings of the Sixth Natural Language Processing Pacific Rim Symposium (pp. 159–166). Tokyo.

Kate, R.J., Luo, X., Patwardhan, S., Franz, M., Florian, R., Mooney, R.J., Roukos, S., & Welty, C

(2010) Learning to predict readability using diverse linguistic features. Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010) (pp. 546–554).

Kincaid, J.P., Fishburne, L.R.P., Rogers, R.L., & Chissom, B.S

(1975) Derivation of new readability formulas for Navy enlisted personnel (pp. 8–75). Research Branch Report, Millington, TN: Chief of Naval Training.

Kintsch, W., Kozminsky, E., Streby, W.J., McKoon, G., & Keenan, J.M

(1975) Comprehension and recall of text as a function of content variables. Journal of Verbal Learning and Verbal Behavior, 14(2), 196–214.

Lin, D

(1996) On the structural complexity of natural language sentences. Proceedings of COLING 1996 (pp. 729–733).

Louis, A., & Nenkova, A

(2013) A corpus of science journalism for analysing writing quality. Dialogue and Discourse, 4(2), 87–117.

Lucisano, P., & Piemontese, M.E

(1988) GulpEase. Una formula per la predizione della difficoltà dei testi in lingua italiana. Scuola e Città, 31, 57–68.

Ma, Y., Fosler-Lussier, E., & Lofthus, R

(2012) Ranking-based readability assessment for early primary children’s literature. Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 548–552). Montréal, Canada.

Marconi, L., Ott, M., Pesenti, E., Ratti, D., & Tavella, M

(1994) Lessico Elementare. Bologna: Zanichelli.

Marinelli, R., Biagini, L., Bindi, R., Goggi, S., Monachini, M., Orsolini, P., Picchi, E., Rossi, S., Calzolari, N., & Zampolli, A

(2003) The italian parole corpus: An overview. In A. Zampolli, et al. (Eds.), Computational Linguistics in Pisa, Special Issue, XVI-XVII, Tomo I (pp. 401–421). Pisa: IEPI.

McDonald, R., & Nivre, J

(2007) Characterizing the errors of data-driven dependency parsing models. Proceedings of EMNLP-CoNLL 2007 (pp. 122–131).

Miller, J., & Weinert, R

(1998) Spontaneous spoken language: Syntax and discourse. Oxford: Clarendon Press.

Nenkova, A., Chae, J., Louis, A., & Pitler, E

(2010) Structural features for predicting the linguistic quality of text applications to machine translation, automatic summarization and human-authored text. In E. Krahmer & M. Theune (Eds.), Empirical Methods in NLG (pp. 222–241). Berlin Heidelberg: Springer-Verlag, LNAI 5790.

Petersen, S.E., & Ostendorf, M

(2006) A machine learning approach to reading level assessment. University of Washington CSE Technical Report.

(2009) A machine learning approach to reading level assessment. Computer Speech and Language, 231, 89–106.

Petrenz, P., & Webber, B

(2011) Stable classification of text genres. Computational Linguistics, 37 (2), 385–393.

Piemontese, M.E

(1996) Capire e farsi capire. Teorie e tecniche della scrittura controllata. Napoli: Tecnodid.

Pitler, E., & Nenkova, A

(2008) Revisiting readability: A unified framework for predicting text quality. Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing (pp. 186–195).

Roark, B., Mitchell, M., & Hollingshead, K

(2007) Syntactic complexity measures for detecting mild cognitive impairment. Proceedings of the Workshop on BioNLP 2007: Biological, Translational, and Clinical Language Processing (pp. 1–8).

Schwarm, S.E., & Ostendorf, M

(2005) Reading level assessment using support vector machines and statistical language models. Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics (ACL 05) (pp. 523–530).

Sheehan, K.M., Flor, M., & Napolitano, D

(2013) A two-stage approach for generating unbiased estimates of text complexity. Proceedings of the Workshop on Natural Language Processing for Improving Textual Accessibility (pp. 49–58). Atlanta, Georgia.

Sheikha, F.A., & Inkpen, D

(2012) Learning to classify documents according to formal and informal style. Linguistic Issues in Language Technology, 8(1), 1–29.

Si, L., & Callan, J

(2001) A statistical model for scientific readability. Proceedings of the Tenth International Conference on Information and Knowledge Management (pp. 574–576).

Sjöholm, J

(2012) Probability as readability: A new machine learning approach to readability assessment for written Swedish. Master thesis, LiU Electronic Press.

Skory, A., & Eskenazi, M

(2010) Predicting cloze task quality for vocabulary training. Proceedings of the NAACL HLT 2010 Fifth Workshop on Innovative Use of NLP for Building Educational Applications (pp. 49–56).

Štajner, S., Evans, R., Orasan, C., & Mitkov, R

(2012) What can readability measures really tell us about text complexity? Proceedings of the the Workshop on Natural Language Processing for Improving Textual Accessibility (NLP4ITA) (pp. 14–21). Istanbul, Turkey.

Tanaka-Ishii, K., Tezuka, S., & Terada, H

(2010) Sorting texts by readability. Computational Linguistics, 36(2), 203–227. Cambridge, MA, USA: MIT Press.

Tonelli, S., Manh, K.T., & Pianta, E

(2012) Making readability indices readable. Proceedings of the First Workshop on Predicting and Improving Text Readability for target reader populations (pp. 40–48). Montréal, Canada.

Vajjala, S., & Meurers, D

(2012) On improving the accuracy of readability classification using insights from second language acquisition. Proceedings of the Seventh Workshop on Building Educational Applications Using NLP (pp. 163–173). Montréal, Canada.

vor der Brück, T., Hartrumpf, S., & Helbig, H

(2008) A readability checker with supervised learning using deep syntactic and semantic indicators. Proceedings of the 11th International Multiconference: Information Society – IS 2008 – Language Technologies (pp. 92–97). Ljubljana, Slovenia.

Woodsend, K., & Lapata, M

(2011) Learning to simplify sentences with quasi-synchronous grammar and integer programming. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2011) (pp. 409–420).

Yngve, V.H.A

(1960) A model and an hypothesis for language structure. Proceedings of the American Philosophical Society (pp. 444–466).

Zipf, G.K

(1988) The psychobiology of language. Boston: Houghton-Miflin.

Cited by (9)

Cited by 9 other publications

Order by:

Nassiri, Naoual, Violetta Cavalli-Sforza & Abdelhak Lakhouaja

2023. Approaches, Methods, and Resources for Assessing the Readability of Arabic Texts. ACM Transactions on Asian and Low-Resource Language Information Processing 22:4 ► pp. 1 ff.

Ferrari, Amerigo, Luca Pirrotta, Manila Bonciani, Giulia Venturi, Milena Vainieri & Luigi Lavorgna

2022. Higher readability of institutional websites drives the correct fruition of the abortion pathway: A cross-sectional study. PLOS ONE 17:11 ► pp. e0277342 ff.

Pirrotta, L., E. Guidotti, C. Tramontani, E. Bignardelli, G. Venturi & S. De Rosis

2022. COVID-19 vaccinations: An overview of the Italian national health system's online communication from a citizen perspective. Health Policy 126:10 ► pp. 970 ff.

Santini, Marina & Arne Jönsson

2020. Pinning down text complexity. Register Studies 2:2 ► pp. 306 ff.

Zhu, Shuqin, Jihua Song, Weiming Peng, Dongdong Guo, Jingbo Sun & Zhihan Lv

2020. The Measurement of Chinese Sentence Semantic Complexity. Complexity 2020 ► pp. 1 ff.

Ferrari, Alessio, Hans Friedrich Witschel, Giorgio Oronzo Spagnolo & Stefania Gnesi

2018. Improving the quality of business process descriptions of public administrations. Business Process Management Journal 24:1 ► pp. 49 ff.

François, Thomas

2015. When readability meets computational linguistics: a new paradigm in readability. Revue française de linguistique appliquée Vol. XX:2 ► pp. 79 ff.

Tejada, Ma Ángeles Zarco, Carmen Noya Gallardo, Ma Carmen Merino Ferradá & Ma Isabel Calderón López

2015. Building a Corpus of 2L English for Automatic Assessment: The CLEC Corpus. Procedia - Social and Behavioral Sciences 198 ► pp. 515 ff.

[no author supplied]

2017. Automatic Text Simplification [Synthesis Lectures on Human Language Technologies, ],

This list is based on CrossRef data as of 4 july 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.