Vol. 47:4 (2023) ► pp.789–829
Towards robust complexity indices in linguistic typology
A corpus-based assessment
There is high hope that corpus-based approaches to language complexity will contribute to explaining linguistic diversity. Several complexity indices have consequently been proposed to compare different aspects among languages, especially in phonology and morphology. However, their robustness against changes in corpus size and content hasn’t been systematically assessed, thus impeding comparability between studies. Here, we systematically test the robustness of four complexity indices estimated from raw texts and either routinely utilized in crosslinguistic studies (Type-Token Ratio and word-level Entropy) or more recently proposed (Word Information Density and Lexical Diversity). Our results on 47 languages strongly suggest that traditional indices are more prone to fluctuation than the newer ones. Additionally, we confirm with Word Information Density the existence of a cross-linguistic trade-off between word-internal and across-word distributions of information. Finally, we implement a proof of concept suggesting that modern deep-learning language models can improve the comparability across languages with non-parallel datasets.
Article outline
- 1.Introduction
- 2.Linguistic complexity across languages: A short overview
- 3.Corpus description and subsampling strategy
- 4.Morphological complexity
- 4.1Grammar-based morphological complexity indices
- 4.1.1Methods
- 4.1.2Crosslinguistic overview
- 4.2Towards robust indices of morphological complexity
- 4.2.1Methods
- 4.2.2Results: Type-Token ratio and entropy
- 4.2.3Results: Measure of textual lexical diversity and word information density
- 4.3Comparing corpus-based and grammar-based indices
- 4.1Grammar-based morphological complexity indices
- 5.Beyond word complexity
- 5.1Methods
- 5.2Results
- 6.Breaking the parallel corpus barrier: A proof of concept
- 6.1Experimental framework
- 6.2Evaluating information content
- 6.3Comparing information density estimations from parallel and non-parallel corpora
- 7.General discussion
- 8.Conclusions
- Acknowledgements
- Notes
-
References
For any use beyond this license, please contact the publisher at [email protected].
https://doi.org/10.1075/sl.22034.oh