Article published in:Exploring Newspaper Language: Using the web to create and investigate a large corpus of modern Norwegian
Edited by Gisle Andersen
[Studies in Corpus Linguistics 49] 2012
► pp. 111–130
Automatic topic classification of a large newspaper corpus
This paper describes how machine learning methods, along with a limited amount of manual classification, were used in a fairly successful attempt to automatically classify, by topic, newspaper articles from The Norwegian Newspaper Corpus.The Norwegian Newspaper Corpus is a challenge to automatically classify, since not all of it has been boilerplate cleaned. Our automatic topic classifier achieved 54% accuracy, while our human annotators achieved only slightly more; 59% accuracy.We used several machine learning methods built into the programming language CRM114. We used used six annotators, who manually classied 1,400 articles from the corpus. The machine learner trained only on errors it made when attempting to classify these articles.
Published online: 23 March 2012