Publications

Publication details [#50894]

Moisl, Hermann. 2009. Using electronic corpora to study language variation: The problem of data sparsity. In Karyolemou, Marilena, Pavlos Pavlou and Stavroula Tsiplakou, eds. Language Variation – European perspectives II. Selected papers from the 4th International Conference on Language Variation in Europe (ICLaVE 4), Nicosia, June 2007. (Studies in Language Variation 5). John Benjamins. pp. 169–178.

Publication type

Article in book

Publication language

English

Keywords

computational linguistics | language variation | statistics

Place, Publisher

John Benjamins

Annotation

As more and larger digital electronic corpora of natural language text appear, effective linguistic analysis of them will increasingly be tractable only by using the computational interpretative methods developed by the statistical, information retrieval, and related communities. To use such analytical methods effectively, however, issues that arise with respect to the abstraction of data from corpora have to be understood. This paper addresses an issue that has a fundamental bearing on the validity of analytical results based on such data: sparsity. The discussion is in three main parts. The first part shows how a particular class of computational methods, exploratory multivariate analysis, can be used in language variation research, the second explains why data sparsity can be a problem in such analysis, and the third outlines a solution.