A method for the comparison of general sequences via type-token ratio
This article proposes a new method for analyzing and comparing general linear sequences with the minimum prior knowledge on the sequences needed. Sequence analysis is a broad problem studied by various fields from sociology and computer security to linguistics or biology. The method presented here applies the simplest quantitative linguistic tools in order to achieve methods transparency and easily interpretable results. The results form a vector describing the sequence and allow their clustering, machine learning and simple visualizations by line charts or multidimensional methods as MDS or tSNE. For completeness, artifacts and several formal models are derived to describe methods behavior in both common and extreme cases.
Article outline
- 1.Introduction: Sequence analysis and its importance
- 2.Method: Quantitative linguistics and the most basic methodology
- Step 1: Normalization of the alphabet
- Step 2: Length normalization
- Step 3: Contextual data collection from n-grams
- Note: Choosing sequence length k and n-gram size bounding
- Step 4: Quantification of properties
- Step 5: Interpretation, visualizations, and beyond
- 3.Visualization methods
- 3.1Basic line-chart
- 3.2Classical Multidimensional Scaling (MDS)
- 3.3t-Distributed Stochastic Neighbor Embedding
- 4.Models of sequence behavior in our method
- 4.1Model of truly random sequences
- 4.2Model of minimal TTR
- 4.3Model of maximal TTR
- 5.Method artefacts and specific n intervals
- 5.1Interval of exhausting vocabulary Q
- 5.2Interval of vocabulary saturation C
- 5.3Interval of maximum variance H
- 6.Comparison to other, similar purpose methods
- 7.Conclusions
-
References