Vol. 3:1 (2021) ► pp.1–32
Genre annotation for the Web
Text-external and text-internal perspectives
This paper describes a digital curation study aimed at comparing the composition of large Web corpora, such as enTenTen, ukWac or ruWac, by means of automatic text classification. First, the paper presents a Deep Learning model suitable for classifying texts from large Web corpora using a small number of communicative functions, such as Argumentation or Reporting. Second, it describes the results of applying the automatic classification model to these corpora and compares their composition. Finally, the paper introduces a framework for interpreting the results of automatic genre classification using linguistic features. The framework can help in comparing general reference corpora obtained from the Web and in comparing corpora across languages.
Article outline
- 1.Introduction
- 1.1Text-external communicative functions
- 1.2Text-internal linguistic features
- 2.Automatic genre identification
- 2.1Text classification model
- 2.2Datasets for training
- 2.3Prediction accuracy
- 3.Comparing large Web corpora
- 4.Communicative functions vs linguistic features
- 4.1Detection of linguistic features
- 4.2Mapping linguistic features to functions
- 4.3Linguistic features across languages
- 5.Related studies on computational analysis of genres
- 6.Conclusions and further work
- Notes
-
References
https://doi.org/10.1075/rs.19015.sha