Modelling crosslinguistic n‑gram correspondence in typologically different languages
N‑gram analysis (popularized e.g. by Biber et al., 1999) has
become a popular method for the identification of recurrent language patterns. Although the extraction of n‑grams from a corpus may seem
straightforward, it proves to be very challenging when applied cross-linguistically (cf. e.g. Ebeling
and Ebeling, 2013; Granger and Lefer, 2013; Čermáková and Chlumská, 2017). The major issue is that the quantities of n‑grams of a certain length in typologically different
languages do not correspond. Consequently, n‑grams of a given length may function differently across languages, rendering a direct
comparison inadequate. Our paper introduces a function capable of modelling the relation between the quantities of n‑grams in typologically
distant languages, using the example of Czech and English (and some other language pairs). Based on our model, we can suggest what n‑gram
lengths should be contrasted to better reflect the size of n‑gram inventories in each language. The correspondence may not be intuitive
(e.g. a Czech 2-gram may best correspond to an English 2.5-gram), but it still provides researchers with a general guide as to what might be
useful to include in their analysis (e.g. in this case 2-grams in Czech and 2- and 3-grams in English).
Article outline
- 1.Introduction
- 1.1N‑grams in corpus linguistics
- 1.2Major issues in cross-linguistic n‑gram correspondence
- 1.2.1N‑gram length
- 1.2.2The number of n‑gram types
- 1.2.3The frequency threshold
- 1.3Our research questions
- 2.Data
- 2.1Corpus material
- 2.2N‑gram extraction
- 2.2.1Word order and syntactic boundaries
- 2.2.2N‑gram settings for this study
- 3.Searching for a model
- 3.1From a basic formula to an adequate model
- 3.2Fitting the model
- 4.Results
- 4.1Czech-English texts
- 4.2Czech-Spanish texts
- 4.3Parameters for other language pairs
- 5.Conclusion
- Acknowledgements
- Notes
-
References
References (23)
Biber, D., Johansson, S., Leech, G., Conrad, S. and Finegan, E.
1999 Longman Grammar of Spoken and Written English. Harlow: Longman.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Biber, D., Kim, Y. and Tracy-Ventura, N.
2010 A Corpus-Driven Approach to Comparative Phraseology: Lexical Bundles in English, Spanish, and Korean. In
Japanese/Korean Linguistics, Volume 171,
S. Iwasaki,
H. Hoji,
P. M. Clancy and
S.-O. Sohn (eds), 75–94. Stanford: Center for the Study of Language and Information (CSLI).
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Cheng, W., Greaves, C. and Warren, M.
Cortes, V.
2008 A Comparative Analysis of Lexical Bundles in Academic History Writing in English and Spanish.
Corpora 3(1): 43–57.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Cvrček, V.
2019 Calc: Corpus Calculator. Prague: Czech National Corpus. Available at
[URL]
Čermáková, A. and Chlumská, L.
Ebeling, J. and Ebeling, S. Oksefjell
Ebeling, J. and Ebeling, S. Oksefjell
2017 A Cross-Linguistic Comparison of Recurrent Word Combinations in a Comparable Corpus of English and Norwegian Fiction. In
Contrasting English and other Languages through Corpora,
M. Janebová,
E. Lapshinova-Koltunski and
M. Martínková (eds), 2–31. Newcastle upon Tyne: Cambridge Scholars Publishing.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Forchini, P. and Murphy, A. C.
Granger, S. and Lefer, M.-A.
Hasselgård, H.
2017 Temporal Expression in English and Norwagian. In
Contrasting English and other Languages through Corpora,
M. Janebová,
E. Lapshinova-Koltunski and
M. Martínková (eds), 75–101. Newcastle upon Tyne: Cambridge Scholars Publishing.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Kim, Y.
2009 Korean Lexical Bundles in Conversations and Academic Texts.
Corpora 4(2): 135–165.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Mahlberg, M.
2012 Corpus Stylistics and Dickens’s Fiction. London: Routledge.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Milička, J.
2013 Rank-Frequency Relation & Type-Token Relation: Two Sides of the Same Coin. In
Methods and Applications of Quantitative Linguistics,
M. Obradovič,
E. Kelih,
R. Köhler (eds), 163–172. Belgrade: University of Belgrade and Academic Mind.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Nebeský, L. and Novák, P.
1996 Větné faktory a jejich podíl na analýze věty.
Slovo a Slovesnost 57(4): 282–295.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Rapoport, A.
1982 Zipf’s Law Re-Visited.
Quantitative Linguistics 16(1): 1–28.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Rosen, A., Vavřín, M. and Zasina, A. J.
2018 The InterCorp Corpus, Version 11 of 11 October 2018. Praha: Institute of the Czech National Corpus. FF UK. Available at
[URL]
Sinclair, J.
2004 The Search for Units of Meaning. In
Trust the Text: Language, Corpus and Discourse,
R. Carter (ed.), 24–48. London: Routledge.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Tracy-Ventura, N., Cortes, V. and Biber, D.
2007 Lexical Bundles in Spanish Speech and Writing. In
Working with Spanish Corpora,
G. Parodi (ed.), 217–231. London: Continuum.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Zipf, G. K.
1949 Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. Cambridge: Addison-Wesley Press.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Cited by (1)
Cited by 1 other publications
Wang, Guanfang, Xianshan Chen, Geng Tian, Jiasheng Yang & Huiling Chen
2022.
A Novel
N
-Gram-Based Image Classification Model and Its Applications in Diagnosing Thyroid Nodule and Retinal OCT Images.
Computational and Mathematical Methods in Medicine 2022
► pp. 1 ff.
![DOI logo](//benjamins.com/logos/doi-logo.svg)
This list is based on CrossRef data as of 5 july 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.