Discourse annotation in the MULTINOT corpus
Issues and challenges
This chapter summarises and discusses recent work on the development of a bilingual (English-Spanish) corpus consisting of original comparable and parallel texts from a variety of genres and annotated with complex linguistic features such as modality and evidentiality, metadiscourse markers, and thematization, as carried out within the framework of the MULTINOT project. The annotation of these complex features in bilingual parallel texts poses important challenges for the researcher at the different stages of the corpus development, from the preprocessing phases to the manual annotation phase, but, at the same time, it allows the investigation of complex linguistic research questions which could not be addressed on the basis of raw corpora or even with the help of an automatic part-of-speech tagging system.
Article outline
- 1.Introduction
- 2.The MULTINOT corpus
- 3.Annotation procedure
- 3.1Selecting the “training” corpus
- 3.2Instantiating the theory
- 3.3Designing annotation schemes and guidelines
- 3.4Performing annotation experiments
- 3.5Evaluating the annotations
- 3.6Large-scale annotation of the whole corpus
- 4.Annotating thematization in English and Spanish
- 5.Annotating modality in English and Spanish
- 6.Annotating metadiscourse markers in English and Spanish
- 7.Summary and concluding remarks
-
Acknowledgement
-
Notes
-
References