Vol. 21:2 (2021) ► pp.217–249
Modelling crosslinguistic n‑gram correspondence in typologically different languages
N‑gram analysis (popularized e.g. by Biber et al., 1999) has become a popular method for the identification of recurrent language patterns. Although the extraction of n‑grams from a corpus may seem straightforward, it proves to be very challenging when applied cross-linguistically (cf. e.g. Ebeling and Ebeling, 2013; Granger and Lefer, 2013; Čermáková and Chlumská, 2017). The major issue is that the quantities of n‑grams of a certain length in typologically different languages do not correspond. Consequently, n‑grams of a given length may function differently across languages, rendering a direct comparison inadequate. Our paper introduces a function capable of modelling the relation between the quantities of n‑grams in typologically distant languages, using the example of Czech and English (and some other language pairs). Based on our model, we can suggest what n‑gram lengths should be contrasted to better reflect the size of n‑gram inventories in each language. The correspondence may not be intuitive (e.g. a Czech 2-gram may best correspond to an English 2.5-gram), but it still provides researchers with a general guide as to what might be useful to include in their analysis (e.g. in this case 2-grams in Czech and 2- and 3-grams in English).
Article outline
- 1.Introduction
- 1.1N‑grams in corpus linguistics
- 1.2Major issues in cross-linguistic n‑gram correspondence
- 1.2.1N‑gram length
- 1.2.2The number of n‑gram types
- 1.2.3The frequency threshold
- 1.3Our research questions
- 2.Data
- 2.1Corpus material
- 2.2N‑gram extraction
- 2.2.1Word order and syntactic boundaries
- 2.2.2N‑gram settings for this study
- 3.Searching for a model
- 3.1From a basic formula to an adequate model
- 3.2Fitting the model
- 4.Results
- 4.1Czech-English texts
- 4.2Czech-Spanish texts
- 4.3Parameters for other language pairs
- 5.Conclusion
- Acknowledgements
- Notes
-
References