Automatic topic classification of a large newspaper corpus

Hagen, Thomas M.

doi:10.1075/scl.49.06hag

Part of

Exploring Newspaper Language: Using the web to create and investigate a large corpus of modern Norwegian
Edited by Gisle Andersen
[Studies in Corpus Linguistics 49] 2012
► pp. 111–130

Automatic topic classification of a large newspaper corpus

Thomas M. Hagen | Uni Computing

This paper describes how machine learning methods, along with a limited amount of manual classification, were used in a fairly successful attempt to automatically classify, by topic, newspaper articles from The Norwegian Newspaper Corpus.The Norwegian Newspaper Corpus is a challenge to automatically classify, since not all of it has been boilerplate cleaned. Our automatic topic classifier achieved 54% accuracy, while our human annotators achieved only slightly more; 59% accuracy.We used several machine learning methods built into the programming language CRM114. We used used six annotators, who manually classied 1,400 articles from the corpus. The machine learner trained only on errors it made when attempting to classify these articles.

Published online: 23 March 2012

https://doi.org/10.1075/scl.49.06hag