Genre annotation for the Web: Text-external and text-internal perspectives

Sharoff, Serge

doi:10.1075/rs.19015.sha

Article published In:

Register Studies
Vol. 3:1 (2021) ► pp.1–32

Genre annotation for the Web

Text-external and text-internal perspectives

Serge Sharoff | University of Leeds

This paper describes a digital curation study aimed at comparing the composition of large Web corpora, such as enTenTen, ukWac or ruWac, by means of automatic text classification. First, the paper presents a Deep Learning model suitable for classifying texts from large Web corpora using a small number of communicative functions, such as Argumentation or Reporting. Second, it describes the results of applying the automatic classification model to these corpora and compares their composition. Finally, the paper introduces a framework for interpreting the results of automatic genre classification using linguistic features. The framework can help in comparing general reference corpora obtained from the Web and in comparing corpora across languages.

Keywords: automatic genre identification, Deep learning, interpreting neural networks

Article outline

1.Introduction
- 1.1Text-external communicative functions
- 1.2Text-internal linguistic features
2.Automatic genre identification
- 2.1Text classification model
- 2.2Datasets for training
- 2.3Prediction accuracy
3.Comparing large Web corpora
4.Communicative functions vs linguistic features
- 4.1Detection of linguistic features
- 4.2Mapping linguistic features to functions
- 4.3Linguistic features across languages
5.Related studies on computational analysis of genres
6.Conclusions and further work
Notes
References

Published online: 3 June 2021

https://doi.org/10.1075/rs.19015.sha

References

Adamzik, Kirsten

1995 Textsorten – Texttypologie. Eine Kommentierte Bibliographie. Münster: Nodus.

Argamon, Shlomo

2019 “Computational Register Analysis and Synthesis.” Register Studies 11.

Argamon, Shlomo, Casey Whitelaw, Paul Chase, Sobhan Raj Hota, Navendu Garg, and Shlomo Levitan

2007 “Stylistic Text Classification Using Functional Lexical Features.” Journal of the American Society for Information Science and Technology 58 (6). Wiley Online Library: 802–22.

Baker, Mona

1996 “Corpus-Based Translation Studies: The Challenges That Lie Ahead.” In Terminology, Lsp and Translation: Studies in Language Engineering, edited by Harold Somers. John Benjamins.

Baroni, Marco, and Silvia Bernardini

2006 “A New Approach to the Study of Translationese: Machine-Learning the Difference Between Original and Translated Text.” Literary and Linguistic Computing 21 (3): 259–74.

Benko, Vladimír

2016 “Two Years of Aranea: Increasing Counts and Tuning the Pipeline.” In Proc Lrec. Portorož, Slovenia.

Biber, Douglas

1988 Variation Across Speech and Writing. Cambridge University Press.

1995 Dimensions of Register Variation: A Cross-Linguistic Comparison. Cambridge University Press.

Biber, Douglas, and Jesse Egbert

2016 “Register Variation on the Searchable Web: A Multi-Dimensional Analysis.” Journal of English Linguistics 44 (2): 95–137.

Biber, Douglas, and Bethany Gray

2016 Grammatical Complexity in Academic English: Linguistic Change in Writing. Cambridge University Press.

Cienki, Alan J.

1989 Spatial Cognition and the Semantics of Prepositions in English, Polish, and Russian. Vol. 2371. Sagner Munich.

Conneau, Alexis, Shijie Wu, Haoran Li, Luke Zettlemoyer, and Veselin Stoyanov

2020 “Emerging Cross-Lingual Structure in Pretrained Language Models.” In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 6022–34. Online: Association for Computational Linguistics.

Crowston, Kevin, Barbara Kwasnik, and Joseph Rubleske

2010 “Problems in the Use-Centered Development of a Taxonomy of Web Genres.” In Genres on the Web: Computational Models and Empirical Studies, edited by Alexander Mehler, Serge Sharoff, and Marina Santini. Springer.

Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova

2018 “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” arXiv Preprint arXiv:1810.04805.

Evert, Stefan

2006 “How Random Is a Corpus? The Library Metaphor.” Zeitschrift Für Anglistik Und Amerikanistik 54 (2): 177–90.

Ferraresi, Adriano, Eros Zanchetta, Silvia Bernardini, and Marco Baroni

2008 “Introducing and Evaluating ukWaC, a Very Large Web-Derived Corpus of English.” In The 4th Web as Corpus Workshop: Can We Beat Google? (At Lrec 2008). Marrakech. [URL].

Forsyth, Richard, and Serge Sharoff

2014 “Document Dissimilarity Within and Across Languages: A Benchmarking Study.” Literary and Linguistic Computing 291: 6–22.

Goodfellow, Ian, Yoshua Bengio, and Aaron Courville

2016 Deep Learning. MIT Press.

Görlach, M.

2004 Text Types and the History of English. Berlin: Walter de Gruyter.

Gulordava, Kristina, Piotr Bojanowski, Edouard Grave, Tal Linzen, and Marco Baroni

2018 “Colorless Green Recurrent Networks Dream Hierarchically.” arXiv Preprint arXiv:1803.11138.

Hearst, Marti A.

1997 “TextTiling: Segmenting Text into Multi-Paragraph Subtopic Passages.” Computational Linguistics 23 (1). MIT Press: 33–64.

Hosmer Jr, David W., Stanley Lemeshow, and Rodney X. Sturdivant

2013 Applied Logistic Regression. John Wiley & Sons.

Jakubíček, Miloš, Adam Kilgarriff, Vojtěch Kovář, Pavel Rychly, and Vít Suchomel

2013 “The Tenten Corpus Family.” In Proc Corpus Linguistics Conference, 125–27. Lancaster.

Kanaris, Ioannis, and Efstathios Stamatatos

2007 “Webpage Genre Identification Using Variable-Length Character N-Grams.” [URL].

Karlgren, Jussi, and Douglass Cutting

1994 “Recognizing Text Genres with Simple Metrics Using Discriminant Analysis.” In COLING ’94: Proc. of the 15th. International Conference on Computational Linguistics, 1071–5. Kyoto, Japan.

Katinskaya, Anisya, and Serge Sharoff

2015 “Applying Multi-Dimensional Analysis to a Russian Webcorpus: Searching for Evidence of Genres.” In Proc Bsnlp. Sofia.

Kessler, Brett, Geoffrey Nunberg, and Hinrich Schütze

1997 “Automatic Detection of Text Genre.” In Proceedings of the 35〖^(th)〗 ACL/8〖^(th)〗 Eacl, 32–38.

Kilgarriff, Adam

2001 “The Web as Corpus.” In Proc Corpus Linguistics 2001. Lancaster. [URL].

Kilgarriff, Adam, and Vít Suchomel

2013 “Web Spam.” In Proc Web as Corpus Workshop (Wac8) at Corpus Linguistics Conference. Lancaster.

Krippendorff, Klaus

2004 “Reliability in Content Analysis: Some Common Misconceptions and Recommendations.” Human Communication Research 30 (3): 411–33.

Kunilovskaya, Maria, and Serge Sharoff

2019 “Building Functionally Similar Corpus Resources for Translation Studies.” In Proc Ranlp, 583–92. Varna.

Lee, David

2001 “Genres, Registers, Text Types, Domains, and Styles: Clarifying the Concepts and Navigating a Path Through the BNC Jungle.” Language Learning and Technology 5 (3): 37–72.

Liu, Bing, and Ian Lane

2016 “Attention-Based Recurrent Neural Network Models for Joint Intent Detection and Slot Filling.” arXiv Preprint arXiv:1609.01454.

Matthiessen, Christian MIM.

2015 “Register in the Round: Registerial Cartography.” Functional Linguistics 2 (1): 1–48.

Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean

2013 “Efficient Estimation of Word Representations in Vector Space.” In Proc. Workshop at Iclr’13.

Nesi, Hilary, and Sheena Gardner

2012 Genres Across the Disciplines: Student Writing in Higher Education. Cambridge: Cambridge University Press.

Petrenz, Philipp, and Bonnie Webber

2010 “Stable Classification of Text Genres.” Computational Linguistics 34 (4): 285–93.

Santini, Marina, Alexander Mehler, and Serge Sharoff

2010 “Riding the Rough Waves of Genre on the Web.” In Genres on the Web: Computational Models and Empirical Studies, edited by Alexander Mehler, Serge Sharoff, and Marina Santini. Berlin/New York: Springer.

Sharoff, Serge

2018 “Functional Text Dimensions for the Annotation of Web Corpora.” Corpora 13 (1): 65–95.

Sharoff, Serge, Dirk Goldhahn, and Uwe Quasthoff

2017 “Frequency Dictionary: Russian.” In, 91:9–14. Frequency Dictionaries. Leipziger Universitätsverlag.

Sharoff, Serge, Zhili Wu, and Katja Markert

2010 “The Web Library of Babel: Evaluating Genre Collections.” In Proc Seventh Language Resources and Evaluation Conference, LREC. Malta.

Sinclair, John

1991 Corpus, Concordance and Collocation. Oxford: OUP.

Sinclair, John, and Jackie Ball

1996 “Preliminary Recommendations on Text Typology.” EAG-TCWG-TTYP/P. Expert Advisory Group on Language Engineering Standards document. [URL].

Sorower, Mohammad S.

2010 “A Literature Survey on Algorithms for Multi-Label Learning.” Vol. 181. Oregon State University.

Stamatatos, Efstathios, George Kokkinakis, and Nikos Fakotakis

2000 “Automatic Text Categorization in Terms of Genre and Author.” Computational Linguistics 26 (4): 471–95.

Straka, Milan, and Jana Straková

2017 “Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe.” In Proc Conll 2017 Shared Task, 88–99. Vancouver, Canada: Association for Computational Linguistics.

Szmrecsanyi, Benedikt

2009 “Typological Parameters of Intralingual Variability: Grammatical Analyticity Versus Syntheticity in Varieties of English.” Language Variation and Change 21 (3). Cambridge University Press: 319–53.

Yang, Yiming, and Jan O. Pedersen

1997 “A Comparative Study on Feature Selection in Text Categorization.” In Proc ICML, edited by Douglas H. Fisher, 412–20. Nashville, US.

Yogatama, Dani, Chris Dyer, Wang Ling, and Phil Blunsom

2017 “Generative and Discriminative Text Classification with Recurrent Neural Networks.” arXiv Preprint arXiv:1703.01898.