Distance Between Languages as Measured by the Minimal-Entropy Model; Plato’s Republic—Slovenian Versus 15 Other translations

Jakopin, Primoz

doi:10.1075/ijcl.6.si.05jak

Article published In:

Text Corpora and Multilingual Lexicography
Wolfgang Teubert
[International Journal of Corpus Linguistics 6:SI] 2001
► pp. 43–53

Distance Between Languages as Measured by the Minimal-Entropy Model; Plato’s Republic—Slovenian Versus 15 Other translations

Primoz Jakopin

In this paper, a language model, based on probabilities of text n-grams, is used as a measure of distance between Slovenian and 15 other European languages. During the construction of the model, a Huffman tree is generated from all the n-grams (n= 1to 32, frequency 2 or more) in the training corpus of Slovenian literary texts (2.7 million words), and appropriate Huffman codes are computed for every leaf in the tree. To apply the model to a new text sample, it is cut into n-grams (1–32) in such a way that the sum of model Huffman code lengths for all the obtained n-grams of new text is minimal.

The above model, applied to all (16) translations of Plato’s Republic from the TELRI CD ROM, produced the following language order (average coding length in bits per character): Slovenian (2,37), Serbocroatian (3,77), Croatian (3,84), Bulgarian (3,96), Czech (4,10), Polish (4,32), Russian (4,46), Slovak (4,46), Latvian (4,74), Lithuanian (4,94), English (5,40), French (5,67), German (5,69), Romanian (5,76), Finnish (6,11), and Hungarian (6,47).

Keywords: entropy, Huffman coding, Plato’s Republic, text statistics, TELRI, quantitative linguistics, European languages, language model

Published online: 17 December 2001

https://doi.org/10.1075/ijcl.6.si.05jak

Cited by (1)

Cited by 1 other publications

Jakopin, Primož

2015. Delež minimalnih parov besed med besednimi oblikami in lemami. Jezikoslovni zapiski 15:1-2

This list is based on CrossRef data as of 4 july 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.