Part-of-speech ratios in English corpora
Andrew Hardie | Lancaster University
Using part-of-speech (POS) tagged corpora, Hudson (1994) reports that approximately 37% of English tokens are nouns, where ‘noun’ is a superordinate category including nouns, pronouns and other word-classes. It is argued here that difficulties relating to the boundaries of Hudson’s ‘noun’ category demonstrate that there is no uncontroversial way to derive such a superordinate category from POS tagging. Decisions regarding the boundary of the ‘noun’ category have small but statistically significant effects on the ratio that emerges for ‘nouns’ as a whole. Tokenisation and categorisation differences between tagging schemes make it problematic to compare the ratio of ‘nouns’ across different tagsets. The precise figures for POS ratios are therefore effectively artefacts of the tagset. However, these objections to the use of POS ratios do not apply to their use as a metric of variation for comparing datasets tagged with the same tagging scheme.
Keywords: part-of-speech tagging, word-class frequency, text type, tagset, tagging scheme, Brown, LOB, BNC Sampler
Published online: 06 April 2007
https://doi.org/10.1075/ijcl.12.1.05har
https://doi.org/10.1075/ijcl.12.1.05har
Cited by
Cited by 4 other publications
EunJooLee
Hardie, Andrew & Isolde van Dorst
Saily, T., T. Nevalainen & H. Siirtola
Säily, Tanja, Turo Vartiainen & Harri Siirtola
This list is based on CrossRef data as of 10 january 2021. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.