Annotation uncertainty in the context of grammatical change
This paper elaborates on the notion of uncertainty in the context of annotation in large text corpora, specifically focusing on (but not limited to) historical languages. Such uncertainty might be due to inherent properties of the language, for example, linguistic ambiguity and overlapping categories of linguistic description, but could also be caused by a lack of annotation expertise. By examining annotation uncertainty in more detail, we identify the sources, deepen our understanding of the nature and different types of uncertainty encountered in daily annotation practice, and discuss practical implications of our theoretical findings. This paper can be seen as an attempt to reconcile the perspectives of the main scientific disciplines involved in corpus projects, linguistics and computer science, to develop a unified view and to highlight the potential synergies between these disciplines.
Article outline
- 1.Introduction
- 2.Current annotation practice and limitations
- 3.Uncertainty in historical (Corpus) linguistics
- 3.1Project context and underlying corpus
- 3.2Annotation uncertainties
- 3.2.1Overlapping categories and the gradualness of change
- 3.2.2Types of categorical gradience
- 3.2.3The human annotator as a source of (subjective) uncertainty
- 4.Mathematical modeling of uncertainty
- 4.1Frame of discernment and ground-truth
- 4.2Uncertainty measures and calculi
- 4.3Vagueness, fuzziness, and graded notions of truth
- 4.4Granularity
- 5.A unified view of uncertainty
- 5.1Fuzziness and ambiguity
- 5.2Incompleteness and lack of knowledge
- 6.Practical implications
- 6.1Types of annotation
- 6.2Experience from the annotation practice: Tagging ambiguities and uncertainties
- 6.2.1Human expert annotator
- 6.2.2Machine annotator
- 7.Conclusions
- Notes
-
References