A method for the comparison of general sequences via type-token ratio

Matlach, Vladimír; Krivochen, Diego Gabriel; Milička, Jiří

doi:10.1075/cilt.356.03mat

Part of

Language and Text: Data, models, information and applications
Edited by Adam Pawłowski, Jan Mačutek, Sheila Embleton and George Mikros
[Current Issues in Linguistic Theory 356] 2021
► pp. 37–54

A method for the comparison of general sequences via type-token ratio

Vladimír Matlach | Palacký University

Diego Gabriel Krivochen | University of Oxford

Jiří Milička | Charles University

This article proposes a new method for analyzing and comparing general linear sequences with the minimum prior knowledge on the sequences needed. Sequence analysis is a broad problem studied by various fields from sociology and computer security to linguistics or biology. The method presented here applies the simplest quantitative linguistic tools in order to achieve methods transparency and easily interpretable results. The results form a vector describing the sequence and allow their clustering, machine learning and simple visualizations by line charts or multidimensional methods as MDS or tSNE. For completeness, artifacts and several formal models are derived to describe methods behavior in both common and extreme cases.

Keywords: sequence analysis, sequence clustering, randomness test, n-gram, type-token relation, type-token ratio, confidence intervals

Article outline

1.Introduction: Sequence analysis and its importance
2.Method: Quantitative linguistics and the most basic methodology
- Step 1: Normalization of the alphabet
- Step 2: Length normalization
- Step 3: Contextual data collection from n-grams
- Note: Choosing sequence length k and n-gram size bounding
- Step 4: Quantification of properties
  - Results: TTR vectors
- Step 5: Interpretation, visualizations, and beyond
3.Visualization methods
- 3.1Basic line-chart
- 3.2Classical Multidimensional Scaling (MDS)
- 3.3t-Distributed Stochastic Neighbor Embedding
4.Models of sequence behavior in our method
- 4.1Model of truly random sequences
- 4.2Model of minimal TTR
- 4.3Model of maximal TTR
5.Method artefacts and specific n intervals
- 5.1Interval of exhausting vocabulary Q
- 5.2Interval of vocabulary saturation C
- 5.3Interval of maximum variance H
6.Comparison to other, similar purpose methods
7.Conclusions
References

Published online: 22 December 2021

https://doi.org/10.1075/cilt.356.03mat

References (29)

References

Blei, David M., Andrew Y. Ng & Michael I. Jordan. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research 3 (4–5). 993–1022.

Bolshoy, Alexander, Zeev (Vladimir) Volkovich, Valery Kirzhner & Zeev Barzily. 2010. Genome clustering from linguistic models to classification of genetic texts. Berlin: Springer.

Conroy, Matthew M. 2018. A collection of dice problems. [URL] (16 August, 2018.)

Cornwell, Benjamin. 2015. Social sequence analysis: Methods and applications, (Structural analysis in the social sciences 37). Cambridge: Cambridge University Press.

d’Imperio, Mary E. 1978. The Voynich manuscript: An elegant enigma. Fort George G. Meade, MD: National Security Agency/Central Security Service.

Gastwirth, Joseph L. 1972. The estimation of the Lorenz curve and Gini index. The Review of Economics and Statistics 54(3). 306–316.

Govindan, Vidya, Rajat Subhra Chakraborty, Pranesh Santikellur & Aditya Kumar Chaudhary. 2018. A hardware Trojan attack on FPGA-based cryptographic key generation: Impact and detection. Journal of Hardware and Systems Security 2. 225–239.

Haahr, Mads. 2018. True random integer generator, RANDOM.ORG: True Random Number Service. Randomness and Integrity Services Ltd.

Hamano, Kenji & Hirosuke Yamamoto. 2010. Randomness test based on T-complexity. Communications and Computer Sciences E93-A(7). 1346–1354.

Hamid, Raffay, Amos Johnson, Samir Batta, Aaron Bobick, Charles Isbell & Graham Coleman. 2005. Detection and explanation of anomalous activities: representing activities as bags of event n-grams. IEEE Computer Society Conference on Computer Vision and Pattern Recognition 1. 1031–1038.

Huffman, David A. 1952. A method for the construction of minimum-redundancy codes. Proceedings of the IRE 40(9). 1098–1101.

Jain, Ashish & Narendra S. Chaudhari. 2015. A new heuristic based on the cuckoo search for cryptanalysis of substitution ciphers. In Sabri Arik, Tingwen Huang, Weng Kin Lai & Qingshan Liu (eds.), Neural Information Processing (Lecture Notes in Computer Science 9490). 206–215. Dordrecht: Springer.

Lasry, George. 2018. A methodology for the cryptanalysis of classical ciphers with search metaheuristics. Kassel: Kassel University Press.

Maaten, Laurens Van Der & Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9(9). 2579–2605.

Matlach, Vladimír. 2018. Aplikace kvantitativní lingvistiky na analýzu sekvencí. Olomouc: Palacký University Olomouc PhD dissertation. [URL] (5 December, 2019.)

Mikros, George & Jan Macutek (eds.). 2015. Sequences in language and text, Volume 69. Berlin: Walter de Gruyter GmbH & Co KG.

Mitzenmacher, Michael & Eli Upfal. 2005. Probability and computing: Randomized algorithms and probabilistic analysis. Cambridge: Cambridge University Press.

Pritchard, Jonathan K., Matthew Stephens & Peter Donnelly. 2000. Inference of population structure using multilocus genotype data. Genetics 155(2). 945–959.

Rao, Rajesh P. 2010. Probabilistic analysis of an ancient undeciphered script. Computer 43(4). 76–80.

Rao, Rajesh P., Nisha Yadav, Mayank N. Vahia, Hrishikesh Joglekar, R. Adhikari & Iravatham Mahadevan. 2009. Entropic evidence for linguistic structure in the Indus script. Science 324. 1165.

Riedel, Marko. 2018. Probability of throwing exactly V distinct sides on N sided dice by K rolls. [URL] (20 June, 2018.)

Rukhin, Andrew, Juan Soto, James Nechvatal, Miles Smid & Elaine Barker. 2001. A statistical test suite for random and pseudorandom number generators for cryptographic applications. Booz-Allen and Hamilton Inc Mclean VA.

Schenkel, Alain, Jun Zhang & Yi-Cheng Zhang. 1993. Long range correlations in human writings. Fractals 1(1). 47–55.

Shannon, Claude E. 1948. A mathematical theory of communication. Bell System Technical Journal 27(3). 623–656.

Sproat, Richard. 2010. Ancient symbols, computational linguistics, and the reviewing practices of the general science journals. Computational Linguistics 36(3). 585–594.

. 2014. A statistical comparison of written language and nonlinguistic symbol systems. Language 90(2). 457–481.

Stuttard, Dafydd & Marcus Pinto. 2011. The web application hacker’s handbook: Finding and exploiting security flaws. Indianapolis: Wiley.

Torgerson, Warren S. 1958. Theory and methods of scaling. New York: Wiley.

Ziv, Jacob & Abraham Lempel. 1978. Compression of individual sequences via variable-rate coding. IEEE Transactions on Information Theory 24(5). 530–536.