A quantitative text model for detecting random texts: From distinguishability to informativity

Konca, Maxim; Mehler, Alexander; Baumartz, Daniel; Hemati, Wahed

doi:10.1075/cilt.356.10kon

Part of

Language and Text: Data, models, information and applications
Edited by Adam Pawłowski, Jan Mačutek, Sheila Embleton and George Mikros
[Current Issues in Linguistic Theory 356] 2021
► pp. 145–162

From distinguishability to informativity

A quantitative text model for detecting random texts

Maxim Konca | Goethe University Frankfurt

Alexander Mehler | Goethe University Frankfurt

Daniel Baumartz | Goethe University Frankfurt

Wahed Hemati | Goethe University Frankfurt

We present a study of the distinctiveness of random and non-random texts based on text characteristics of quantitative linguistics. We additionally experiment with text features that evaluate contiguity associations among sentences by means of BERT (Bidirectional Encoder Representations from Transformers). To this end, we experiment with generative models for random texts as currently discussed in the context of neural networks. The chapter contributes to the clarification of deficits of existing random text models and of the informativeness of quantitative text features.

Keywords: random text, quantitative text characteristics, text classification, BERT

Article outline

1.Introduction
2.Text corpora and their quantification
- 2.1Quantification
- 2.2Text corpora and their randomization
- 2.3Classification and evaluation methods
3.Results
4.Discussion
5.Conclusion
Notes
References

Published online: 22 December 2021

https://doi.org/10.1075/cilt.356.10kon

References (43)

References

Altmann, Gabriel. 1988. Wiederholungen in Texten. Bochum: Brockmeyer.

Baayen, Harald, Hans van Halteren & Fiona Tweedie. 1996. Outside the cave of shadows: Using syntactic annotation to enhance authorship attribution. Literary and Linguistic Computing 11(3). 121–131.

Bahdanau, Dzmitry, Kyunghyun Cho & Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.

Bengio, Yoshua, Réjean Ducharme, Pascal Vincent & Christian Jauvin. 2003. A neural probabilistic language model. Journal of Machine Learning Research 3. 1137–1155.

Biemann, Chris. 2007. A random text model for the generation of statistical language invariants. In Candace Sidner, Tanja Schultz, Matthew Stone & ChengXiang Zhai (eds.), Human language technologies 2007: The conference of the North American chapter of the association for computational linguistics; proceedings of the main conference, 105–112. Rochester, NY: Association for Computational Linguistics.

Boser, Bernhard E., Isabelle M. Guyon & Vladimir N. Vapnik. 1992. A training algorithm for optimal margin classifiers. In David Haussler (ed.), Proceedings of the fifth annual workshop on computational learning theory, 144–152. New York: Association for Computing Machinery.

Breiman, Leo. 2001. Random forests. Machine Learning 45(1). 5–32.

Campolongo, Francesca, Jessica Caribon & Andrea Saltelli. 2007. An effective screening design for sensitivity analysis of large models. Environmental Modelling & Software 22(10). 1509–1518.

Čech, Radek. 2015. Text length and the lambda frequency structure of a text. In George K. Mikros & Ján Macutek (eds.), Sequences in language and text, 71–88. Berlin: De Gruyter Mouton.

Čech, Radek, Ioan-Iovitz Popescu & Gabriel Altmann. 2013. Methods of analysis of a thematic concentration of the text. Czech and Slovak Linguistic Review 3. 4–21.

Chang, Chih-Chung & Chih-Jen Lin. 2011. Libsvm: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST) 2(3). 1–27.

Cheng, Jianpeng, Li Dong & Mirella Lapata. 2016. Long short-term memory-networks for machine reading. In Jian Su, Kevin Duh & Xavier Carreras (eds.), Proceedings of the 2016 conference on empirical methods in natural language processing, 551–561. Austin, TX: Association for Computational Linguistics.

Cohen, Jacob. 1960. A coefficient of agreement for nominal scales. Educational and psychological measurement 20(1). 37–46.

Devlin, Jacob, Ming-Wei Chang, Kenton Lee & Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

Maria Dolores Esteban & Domingo Morales. 1995. A summary on entropy statistics. Kybernetika 31(4). 337–346.

Gabrilovich, Evgeniy & Shaul Markovitch. 2006. Overcoming the brittleness bottleneck using Wikipedia: Enhancing text categorization with encyclopedic knowledge. In Proceedings of the twenty-first national conference on artificial intelligence, 2006 Jul 16 (Vol. 6, pp. 1301–1306) Boston, MA: AAAI Press.

Hirsch, Jorge E. 2005. An index to quantify an individual’s scientific research output. Proceedings of the National Academy of Sciences 102(46). 16569–16572.

Hochreiter, Sepp & Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9(8). 1735–1780.

Joachims, Thorsten. 2002. Learning to classify text using support vector machines. Boston: Kluwer.

Kubát, Miroslav, Vladimír Matlach & Radek Čech. 2014. Quita. Quantitative Index Text Analyzer. Lüdenscheid: RAM-Verlag.

McIntosh, Robert P. 1967. An index of diversity and the relation of certain concepts to diversity. Ecology 48(3). 392–404.

Mehler, Alexander. 2005. Eigenschaften der textuellen Einheiten und Systeme [Properties of textual units and systems]. In Reinhard Köhler, Gabriel Altmann & Rajmund G. Piotrowski (eds.), Quantitative linguistik. ein internationales handbuch / quantitative linguistics. An international handbook, 325–348. Berlin: De Gruyter.

Mehler, Alexander, Peter Geibel & Olga Pustylnikov. 2007. Structural classifiers of text types: Towards a novel model of text representation. Journal for Language Technology and Computational Linguistics (JLCL) 22(2). 51–66.

Mehler, Alexander, Wahed Hemati, Rüdiger Gleim & Daniel Baumartz. 2018. VienNA: Auf dem Weg zu einer Infrastruktur für die verteilte interaktive evolutionäre Verarbeitung natürlicher Sprache. In Henning Lobin, Roman Schneider & Andreas Witt (eds.), Forschungsinfrastrukturen und digitale Informationssysteme in der germanistischen Sprachwissenschaft, Volume 6, 149–176). Berlin: De Gruyter.

Mehler, Alexander, Wahed Hemati, Tolga Uslu & Andy Lücking. 2018. A multidimensional model of syntactic dependency trees for authorship attribution. In Jingyang Jiang & Haitao Liu (eds.), Quantitative analysis of dependency structures, 315–348. Berlin: De Gruyter.

Metz, Luke, Ben Poole, David Pfau & Jascha Sohl-Dickstein. 2016. Unrolled generative adversarial networks. arXiv preprint arXiv:1611.02163.

Morris, Max D. 1991. Factorial sampling plans for preliminary computational experiments. Technometrics 33(2). 161–174.

Parzen, Emanuel. 1963. On spectral analysis with missing observations and amplitude modulation. Sankhyā: The Indian Journal of Statistics, Series A, 383–392.

Popescu, Ioan-Iovitz. 2009. Word frequency studies, Volume 64. Berlin: Walter de Gruyter.

Popescu, Ioan-Iovitz & Gabriel Altmann. 2006. Some aspects of word frequencies. Glottometrics 13. 23–46.

. 2007. Writer’s view of text generation. Glottometrics, 15, 71–81.

. 2011. Thematic concentration in texts. Issues in quantitative linguistics 2. 110–116.

Popescu, Ioan-Iovitz, Radek Čech & Gabriel Altmann. 2011. The lambda-structure of texts. Lüdenscheid: Ram-Verlag.

Radford, Alec, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei & Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog 1(8). 9.

Reiter, Ehud & Robert Dale. 1997. Building applied natural language generation systems. Natural Language Engineering 3(1). 57–87.

Saltelli, Andrea. 2002. Making best use of model evaluations to compute sensitivity indices. Computer physics communications 145(2). 280–297.

Saltelli, Andrea, Paola Annoni, Ivano Azzini, Francesca Campolongo, Marco Ratto & Stefano Tarantola. 2010. Variance based sensitivity analysis of model output. design and estimator for the total sensitivity index. Computer Physics Communications 181(2). 259–270.

Smola, Alex J. & Bernhard Schölkopf. 2004. A tutorial on support vector regression. Statistics and computing 14(3). 199–222.

Sobol, Ilya M. 2001. Global sensitivity indices for nonlinear mathematical models and their Monte Carlo estimates. Mathematics and Computers in Simulation 55(1–3). 271–280.

Székely, Gábor J., Maria L. Rizzo & Nail K. Bakirov. 2007. Measuring and testing dependence by correlation of distances. The Annals of Statistics 35(6). 2769–2794.

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser & Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems (pp. 5998–6008).

Wimmer, Gejza. 2005. The type-token-relation. In Reinhard Köhler, Gabriel Altmann & Rajmund G. Piotrowski (eds.), Quantitative Linguistik: Ein internationales Handbuch [Quantitative linguistics: An international handbook], 361–368. Berlin: De Gruyter.

Zhu, Yukun, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba & Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, 19–27. Cambridge, MA: IEEE.