Development of evaluation methods: A Modern Greek readability tool

Mikros, George; Voskaki, Rania

doi:10.1075/cilt.356.11mik

Part of

Language and Text: Data, models, information and applications
Edited by Adam Pawłowski, Jan Mačutek, Sheila Embleton and George Mikros
[Current Issues in Linguistic Theory 356] 2021
► pp. 163–176

A Modern Greek readability tool

Development of evaluation methods

George Mikros | Hamad Bin Khalifa University

Rania Voskaki | Centre for the Greek Language

The aim of this paper is to develop an automatic readability analysis tool that focusses on Modern Greek as a foreign language. Based on previous work done in the Centre for the Greek Language (CGL), we offer an enhanced methodology in readability prediction for Modern Greek texts matching the adequacy level (A1 to C2) according to the Common European Framework of Languages. The proposed tool is based on several stylometric indices inspired by work done in the field of quantitative linguistics. The resulting feature vectors train a Random Forest, a robust and accurate machine learning algorithm that predicts readability in our testing dataset with 0.943 accuracy, surpassing all previous readability tools for Modern Greek. Further, analysis of the results with advanced visualization methods reveals the complex and fluid dynamics of the features used and their readability predictions.

Keywords: readability tool, corpora, annotation, evaluation methods

Article outline

1.Introduction
2.Readability analysis: A short literature review
3.Methodology
- 3.1Corpus
- 3.2Features
- 3.3Machine learning algorithm: Random Forest
4.Results
5.Conclusion
Note
References

Published online: 22 December 2021

https://doi.org/10.1075/cilt.356.11mik

References (32)

References

Azpiazu, Ion Madrazo & Maria Soledad Pera. 2019. Multiattentive recurrent neural network architecture for multilingual readability assessment. Transactions of the Association for Computational Linguistics 7. 421–436.

Breiman, Leo. 2001. Random forests. Machine Learning 45(1). 5–32.

Collins-Thompson, Kevyn & James P. Callan. 2004. A language modeling approach to predicting reading difficulty. Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004, 193–200. Boston, MA: Association for Computational Linguistics.

Dale, Edgar & Jeanne S. Chall. 1948. A formula for predicting readability. Educational Research Bulletin 27(2). 37–54.

DuBay, William H. 2004. The principles of readability. Costa Mesa, CA: Impact Information.

Flesch, Rudolf. 1948. A new readability yardstick. Journal of Applied Psychology 32. 221–233.

François, Thomas & Cédrick Fairon. 2012. An “AI readability” formula for French as a foreign language. In Jun’ichi Tsujii, James Henderson & Marius Paşca (eds.), Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 466–477. Jeju Island, Korea: Association for Computational Linguistics.

Fry, Edward. 1968. A readability formula that saves time. Journal of Reading 11(7). 513–578.

Graesser, Arthur C., Danielle S. McNamara, Max M. Louwerse & Zhiqiang Cai. 2004. Coh-Metrix: Analysis of text on cohesion and language. Behavior Research Methods, Instruments, & Computers 36(2). 193–202.

Gunning, Robert. 1952. The technique of clear writing. New York: McGraw-Hill.

Hirsch, Jorge E. 2005. An index to quantify an individual’s scientific research output. Proceedings of the National Academy of Sciences of the United States of America 102(46). 16569–16572.

Kho, Julia. 2018 October 19. Why random forest is my favorite machine learning model. Towards Data Science. Retrieved 5 September 2020, from [URL]

Kincaid, Peter J., Robert P. Fishburne Jr., Richard L. Rogers & Brad S. Chissom. 1975. Derivation of new readability formulas (Automated Readability Index, Fog Count, and Flesch Reading Ease Formula) for Navy enlisted personnel. Millington, TN: Chief of Naval Technical Training Naval Air Station Memphis.

Koehrsen, Will. 2018 August 30. An implementation and explanation of the random forest in Python. Towards Data Science. Retrieved 5 September 2020, from [URL]

Kubát, Miroslav, Vladimír Matlach & Radek Čech. 2014. QUITA: Quantitative Index Text Analyzer. Lüdenscheid: RAM-Verlag.

Martinc, Matej, Senja Pollak & Marko Robnik Šikonja. 2018. Assessing readability with deep neural language models. Paper presented at the 2nd HBP Student Conference: Transdisciplinary Research Linking Neuroscience, Brain Medicine and Computer Science, Ljubljana, Slovenia, February 14–16.

McIntosh, Robert P. 1967. An index of diversity and the relation of certain concepts to diversity. Ecology 48(3). 392–404.

McLaughlin, G. Harry. 1969. SMOG Grading – a new readability formula. Journal of Reading 12(8). 639–646.

Milone, Michael. 2014. Development of the ATOS® Readability Formula. Wisconsin Rapids, WI: Renaissance Learning, Inc.

Mohammadi, Hamid & Seyed Hossein Khasteh. 2019. Text as environment: A deep reinforcement learning text readability assessment model. arXiv preprint arXiv:1912.05957.

Oakes, Michael P. 1998. Statistics for corpus linguistics. Edinburgh: Edinburgh University Press.

Pitler, Emily & Ani Nenkova. 2008. Revisiting readability: A unified framework for predicting text quality. In Mirella Lapata & Hwee Tou Ng (eds.), Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, 186–195. Honolulu, HI: Association for Computational Linguistics.

Popescu, Ioan-Iovitz. 2007. The ranking by the weight of highly frequent words. In Peter Grzybek & Reinhard Köhler (eds.), Exact methods in the study of language and text, 555–565. Berlin: De Gruyter.

Popescu, Ioan-Iovitz & Gabriel Altmann. 2007. Writer’s view of text generation. Glottometrics 15. 71–81.

Popescu, Ioan-Iovitz, Karl-Heinz Best & Gabriel Altmann. 2007. On the dynamics of word classes in text. Glottometrics 14. 58–71.

Popescu, Ioan-Iovitz, Gabriel Almann, Peter Grzybek, Bijapur D. Jayaram, Reinhard Köhler, Viktor Krupa, Ján Mačutek, Regina Pustet, Ludmila Uhlířová & Matummal N. Vidya. 2009a. Word frequency studies. Berlin: Mouton de Gruyter.

Popescu, Ioan-Iovitz, Ján Mačutek & Gabriel Altmann. 2009b. Aspects of word frequencies. Lüdenscheid: RAM-Verlag.

Popescu, Ioan-Iovitz, Ján Mačutek, Emmerich Kelih, Radek Čech, Karl-Heinz Best & Gabriel Altmann. 2010. Vectors and codes of text. Lüdenscheid: RAM-Verlag.

Schwarm, Sarah E. & Mari Ostendorf. 2005. Reading level assessment using support vector machines and statistical language models. In Kevin Knight, Hwee Tou Ng & Kemal Oflazer Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, 523–530. Ann Arbor, MI: Association for Computational Linguistics.

Tweedie, Fiona J. & Harald R. Baayen. 1998. How variable may a constant be? Measures of lexical richness in perspective. Computers and the Humanities 32(5). 323–352.

Welling, Soeren H., Hanne H. F. Refsgaard, Per B. Brockhoff & Line H. Clemmensen. 2016. Forest floor visualizations of random forests. arXiv preprint arXiv:1605.09196.

Yule, George Udny. 1944. The statistical study of literary vocabulary. Cambridge: Cambridge University Press.