A combined rule-based and statistical tagger: OBT+stat

Johannessen, Janne Bondi; Hagen, Kristin; Lynum, André; Nøklestad, Anders

doi:10.1075/scl.49.03joh

Part of

Exploring Newspaper Language: Using the web to create and investigate a large corpus of modern Norwegian
Edited by Gisle Andersen
[Studies in Corpus Linguistics 49] 2012
► pp. 51–66

OBT+stat

A combined rule-based and statistical tagger

Janne Bondi Johannessen | The Text Laboratory, University of Oslo

Kristin Hagen | The Text Laboratory, University of Oslo

André Lynum | The Text Laboratory, University of Oslo

Anders Nøklestad | The Text Laboratory, University of Oslo

The paper describes the improvement of the rule-based Constraint Grammar (CG) Oslo-Bergen Tagger (OBT) by the addition of a statistical module. It is in the nature of CG taggers to leave some words ambiguous between different readings, due to a lack of coverage by the linguistics-based rules. Such ambiguities are often a problem for applications that use the tagger, among them the Norwegian Newspaper Corpus. Our statistical module not only removes part of speech (PoS) and morphological ambiguities, but also disambiguates lemmas. We show how this new system, referred to as OBT+stat, in a straightforward manner combines the strengths of the linguistic knowledge-based CG approach with data-driven methods. The result is a high-performing, fully disambiguating PoS/morphological tagger and lemmatizer with very satisfactory evaluation results.

Published online: 23 March 2012

https://doi.org/10.1075/scl.49.03joh

Cited by

Cited by 2 other publications

Hagen, Kristin & Øystein A. Vangsnes

2023. LIA-korpusa – eldre talemålsopptak for norsk og samisk gjort tilgjengelege. Nordlyd 47:2 ► pp. 119 ff.

Haugen, Tor Arne

2021. When complementation gets specific: A study of collocational preferences in verb–object combinations in Norwegian. Nordic Journal of Linguistics 44:1 ► pp. 71 ff.

This list is based on CrossRef data as of 22 june 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.