Modelling crosslinguistic n‑gram correspondence in typologically different languages
Jiří Milička | Charles University , Czech Republic
Václav Cvrček | Charles University , Czech Republic
Lucie Lukešová | Charles University , Czech Republic
N‑gram analysis (popularized e.g. by Biber et al., 1999) has
become a popular method for the identification of recurrent language patterns. Although the extraction of n‑grams from a corpus may seem
straightforward, it proves to be very challenging when applied cross-linguistically (cf. e.g. Ebeling
and Ebeling, 2013; Granger and Lefer, 2013; Čermáková and Chlumská, 2017). The major issue is that the quantities of n‑grams of a certain length in typologically different
languages do not correspond. Consequently, n‑grams of a given length may function differently across languages, rendering a direct
comparison inadequate. Our paper introduces a function capable of modelling the relation between the quantities of n‑grams in typologically
distant languages, using the example of Czech and English (and some other language pairs). Based on our model, we can suggest what n‑gram
lengths should be contrasted to better reflect the size of n‑gram inventories in each language. The correspondence may not be intuitive
(e.g. a Czech 2-gram may best correspond to an English 2.5-gram), but it still provides researchers with a general guide as to what might be
useful to include in their analysis (e.g. in this case 2-grams in Czech and 2- and 3-grams in English).
Keywords: n‑grams, parallel corpus, correspondence, Czech/English/Spanish
Published online: 12 January 2021
https://doi.org/10.1075/lic.19018.mil
https://doi.org/10.1075/lic.19018.mil
References
References
Baker, M.
Biber, D., Johansson, S., Leech, G., Conrad, S. and Finegan, E.
Biber, D., Kim, Y. and Tracy-Ventura, N.
Cheng, W., Greaves, C. and Warren, M.
Cortes, V.
Cvrček, V.
2019 Calc: Corpus Calculator. Prague: Czech National Corpus. Available at http://www.korpus.cz/calc
Čermák, F. and Rosen, A.
Čermáková, A. and Chlumská, L.
Ebeling, J. and Ebeling, S. Oksefjell
2017 A Cross-Linguistic Comparison of Recurrent Word Combinations in a Comparable Corpus of English and Norwegian Fiction. In Contrasting English and other Languages through Corpora, M. Janebová, E. Lapshinova-Koltunski and M. Martínková (eds), 2–31. Newcastle upon Tyne: Cambridge Scholars Publishing.
Forchini, P. and Murphy, A. C.
Granger, S.
Granger, S. and Lefer, M.-A.
Hasselgård, H.
Milička, J.
Nebeský, L. and Novák, P.
Rosen, A., Vavřín, M. and Zasina, A. J.
2018 The InterCorp Corpus, Version 11 of 11 October 2018. Praha: Institute of the Czech National Corpus. FF UK. Available at www.korpus.cz
Sinclair, J.
Tracy-Ventura, N., Cortes, V. and Biber, D.