Automatic topic classification of a large newspaper corpus
This paper describes how machine learning methods, along with a limited amount of manual classification, were used in a fairly successful attempt to automatically classify, by topic, newspaper articles from The Norwegian Newspaper Corpus.The Norwegian Newspaper Corpus is a challenge to automatically classify, since not all of it has been boilerplate cleaned. Our automatic topic classifier achieved 54% accuracy, while our human annotators achieved only slightly more; 59% accuracy.We used several machine learning methods built into the programming language CRM114. We used used six annotators, who manually classied 1,400 articles from the corpus. The machine learner trained only on errors it made when attempting to classify these articles.