Genre annotation for the Web
Text-external and text-internal perspectives
Serge Sharoff | University of Leeds
This paper describes a digital curation study aimed at comparing the composition of large Web corpora, such as
enTenTen, ukWac or ruWac, by means of automatic text classification. First, the paper presents a Deep Learning model suitable for
classifying texts from large Web corpora using a small number of communicative functions, such as Argumentation or Reporting.
Second, it describes the results of applying the automatic classification model to these corpora and compares their composition.
Finally, the paper introduces a framework for interpreting the results of automatic genre classification using linguistic
features. The framework can help in comparing general reference corpora obtained from the Web and in comparing corpora across
languages.
Article outline
- 1.Introduction
- 1.1Text-external communicative functions
- 1.2Text-internal linguistic features
- 2.Automatic genre identification
- 2.1Text classification model
- 2.2Datasets for training
- 2.3Prediction accuracy
- 3.Comparing large Web corpora
- 4.Communicative functions vs linguistic features
- 4.1Detection of linguistic features
- 4.2Mapping linguistic features to functions
- 4.3Linguistic features across languages
- 5.Related studies on computational analysis of genres
- 6.Conclusions and further work
- Notes
-
References
Published online: 03 June 2021
https://doi.org/10.1075/rs.19015.sha
https://doi.org/10.1075/rs.19015.sha
References
Argamon, Shlomo, Casey Whitelaw, Paul Chase, Sobhan Raj Hota, Navendu Garg, and Shlomo Levitan
Baker, Mona
Baroni, Marco, and Silvia Bernardini
Benko, Vladimír
Biber, Douglas, and Jesse Egbert
Biber, Douglas, and Bethany Gray
Cienki, Alan J.
Conneau, Alexis, Shijie Wu, Haoran Li, Luke Zettlemoyer, and Veselin Stoyanov
Crowston, Kevin, Barbara Kwasnik, and Joseph Rubleske
Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova
Evert, Stefan
Ferraresi, Adriano, Eros Zanchetta, Silvia Bernardini, and Marco Baroni
2008 “Introducing and Evaluating ukWaC, a Very Large Web-Derived Corpus of English.” In The 4th Web as Corpus Workshop: Can We Beat Google? (At Lrec 2008). Marrakech. http://clic.cimec.unitn.it/marco/publications/lrec2008/lrec08-ukwac.pdf.
Forsyth, Richard, and Serge Sharoff
Gulordava, Kristina, Piotr Bojanowski, Edouard Grave, Tal Linzen, and Marco Baroni
Hearst, Marti A.
Hosmer Jr, David W., Stanley Lemeshow, and Rodney X. Sturdivant
Jakubíček, Miloš, Adam Kilgarriff, Vojtěch Kovář, Pavel Rychly, and Vít Suchomel
Kanaris, Ioannis, and Efstathios Stamatatos
2007 “Webpage Genre Identification Using Variable-Length Character N-Grams.” http://www.icsd.aegean.gr/lecturers/Stamatatos/papers/ICTAI-2007.pdf. 
Karlgren, Jussi, and Douglass Cutting
Katinskaya, Anisya, and Serge Sharoff
Kessler, Brett, Geoffrey Nunberg, and Hinrich Schütze
Kilgarriff, Adam
2001 “The Web as Corpus.” In Proc Corpus Linguistics 2001. Lancaster. http://www.itri.bton.ac.uk/techreports/ITRI-01-14.abs.html.
Kilgarriff, Adam, and Vít Suchomel
Krippendorff, Klaus
Kunilovskaya, Maria, and Serge Sharoff
Lee, David
Liu, Bing, and Ian Lane
Matthiessen, Christian MIM.
Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean
Nesi, Hilary, and Sheena Gardner
Petrenz, Philipp, and Bonnie Webber
Santini, Marina, Alexander Mehler, and Serge Sharoff
Sharoff, Serge
Sharoff, Serge, Dirk Goldhahn, and Uwe Quasthoff
Sharoff, Serge, Zhili Wu, and Katja Markert
Sinclair, John, and Jackie Ball
1996 “Preliminary Recommendations on Text Typology.” EAG-TCWG-TTYP/P. Expert Advisory Group on Language Engineering Standards document. http://www.ilc.cnr.it/EAGLES96/texttyp/texttyp.html.
Sorower, Mohammad S.
Stamatatos, Efstathios, George Kokkinakis, and Nikos Fakotakis
Straka, Milan, and Jana Straková
Szmrecsanyi, Benedikt
Yang, Yiming, and Jan O. Pedersen