Part-of-speech ratios in English corpora

Hardie, Andrew

doi:10.1075/ijcl.12.1.05har

Article published In:

International Journal of Corpus Linguistics
Vol. 12:1 (2007) ► pp.55–81

Part-of-speech ratios in English corpora

Andrew Hardie | Lancaster University

Using part-of-speech (POS) tagged corpora, Hudson (1994) reports that approximately 37% of English tokens are nouns, where ‘noun’ is a superordinate category including nouns, pronouns and other word-classes. It is argued here that difficulties relating to the boundaries of Hudson’s ‘noun’ category demonstrate that there is no uncontroversial way to derive such a superordinate category from POS tagging. Decisions regarding the boundary of the ‘noun’ category have small but statistically significant effects on the ratio that emerges for ‘nouns’ as a whole. Tokenisation and categorisation differences between tagging schemes make it problematic to compare the ratio of ‘nouns’ across different tagsets. The precise figures for POS ratios are therefore effectively artefacts of the tagset. However, these objections to the use of POS ratios do not apply to their use as a metric of variation for comparing datasets tagged with the same tagging scheme.

Keywords: part-of-speech tagging, word-class frequency, text type, tagset, tagging scheme, Brown, LOB, BNC Sampler

Published online: 6 April 2007

https://doi.org/10.1075/ijcl.12.1.05har

Cited by

Cited by 5 other publications

Order by:

EunJooLee

2008. An analysis of corpus-based research on TEFL and applied linguistics.. English Teaching 63:2 ► pp. 283 ff.

Hardie, Andrew & Isolde van Dorst

2020. A survey of grammatical variability in Early Modern English drama. Language and Literature: International Journal of Stylistics 29:3 ► pp. 275 ff.

Saily, T., T. Nevalainen & H. Siirtola

2011. Variation in noun and pronoun frequencies in a sociohistorical corpus of English. Literary and Linguistic Computing 26:2 ► pp. 167 ff.

Säily, Tanja, Turo Vartiainen & Harri Siirtola

2017. Exploring part-of-speech frequencies in a sociohistorical corpus of English. In Exploring Future Paths for Historical Sociolinguistics [Advances in Historical Sociolinguistics, 7], ► pp. 23 ff.

Säily, Tanja, Turo Vartiainen, Harri Siirtola & Terttu Nevalainen

2024. Changing styles of letter-writing?. In Unlocking the History of English [Current Issues in Linguistic Theory, 364], ► pp. 154 ff.

This list is based on CrossRef data as of 1 june 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.