The example of the bibliographic corpus of microtexts: Book genre and author’s gender recognition based on titles

Pawłowski, Adam; Herden, Elżbieta; Walkowiak, Tomasz

doi:10.1075/cilt.356.15paw

Part of

Language and Text: Data, models, information and applications
Edited by Adam Pawłowski, Jan Mačutek, Sheila Embleton and George Mikros
[Current Issues in Linguistic Theory 356] 2021
► pp. 225–238

Book genre and author’s gender recognition based on titles

The example of the bibliographic corpus of microtexts

Adam Pawłowski | University of Wrocław

Elżbieta Herden | University of Wrocław

Tomasz Walkowiak | Wrocław University of Science and Technology

The subject of this chapter is the application of automatic taxonomy methods to the corpus of microtexts, consisting of book titles. We test two hypotheses. The first one claims that simply on the basis of a book title one can automatically recognize its genre (writing species). The second assumes the possibility of recognizing the author’s gender on the basis of the book’s title. FastText and word2vec methods were applied. The analyses give a positive (and rather astonishing) result: with properly chosen n-grams more than 70% of titles could be correctly assigned a writing species, while the accuracy of the gender recognition of the author was almost 80%. Both values significantly exceed the levels of random recognition. The research was conducted on the corpus of titles derived from the Polish national bibliography.

Keywords: corpus linguistics, automatic taxonomy, gender recognition, book genre, fastText, word2vec, bibliography, Polish

Article outline

1.The problem
2.Data and research hypotheses
3.Methodology
4.Experiments and results
- 4.1Recognizing the literary genre of the text
- 4.2Automatic recognition of the author’s gender
5.Conclusions
Notes
References

Published online: 22 December 2021

https://doi.org/10.1075/cilt.356.15paw

References (15)

References

Chiang, Holly, Yifan Ge & Connie Wu. 2015. Classification of book genres by cover and title. [URL] (7 September, 2020).

Goodman, Joshua. 2001. Classes for fast maximum entropy training. In 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing, I-561–I-564. Salt Late City, UT: IEEE. [URL] (7 September, 2020.)

Grave, Edouard, Piotr Bojanowski, Prakhar Gupta, Armand Joulin & Tomas Mikolov. 2018. Learning word vectors for 157 languages. In Nicoletta Calzolari et al. (eds.), Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018). Miyazaki, Japan: European Language Resources Association (ELRA). [URL] (7 September, 2020).

Harris, Zellig S. 1954. Distributional structure. WORD 10(2–3). 146–162.

Hastie, Trevor, Robert Tibshirani & Jerome Friedman. 2013. The elements of statistical learning: Data mining, inference and prediction (Springer series in statistics). New York: Springer.

Joulin, Armand, Edouard Grave, Piotr Bojanowski & Tomas Mikolov. 2017. Bag of tricks for efficient text classification. In Mirella Lapata, Phil Blunsom & Alexander Koller (eds.), Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, 427–431. Valencia, Spain: Association for Computational Linguistics. [URL] (7 September, 2020).

Le, Quoc & Tomas Mikolov. 2014. Distributed representations of sentences and documents. In Eric P. Xing & Tony Jebara (eds.), Proceedings of the 31st International Conference on Machine Learning, 1188–1196. Bejing: JMLR. [URL] (7 September, 2020).

Mikolov, Tomas, Wen-tau Yih & Geoffrey Zweig. 2013. Linguistic regularities in continuous space word representations. In Lucy Vanderwende, Hal Daumé III & Katrin Kirchhoff, Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Atlanta: Association for Computational Linguistics, 746–751. [URL] (7 September, 2020).

Mikros, George K. 2013. Systematic stylometric differences in men and women authors: a corpus-based study. In Reinhard Köhler & Gabriel Altmann (eds.), Issues in Quantitative Linguistics 3: Dedicated to Karl-Heinz Best on the Occasion of His 70th Birthday, 206–223, Lüdenscheid: RAM–Verlag. [URL] (7 September, 2020.)

Mikros, George K. & Kostas Perifanos. 2013. Authorship attribution in Greek tweets using author’s multilevel n-gram profiles. In Eduard Hovy, Vita Markman, Craig Martell & David Uthus (eds.), AAAI Spring Symposium: Analyzing Microtext. [URL]. (7 September, 2020.)

Ozsarfati, Eran, Egemen Sahin, Can J. Saul & Alper Yilmaz. 2019. Book genre classification based on titles with comparative machine learning algorithms. In IEEE 4th International Conference on Computer and Communication Systems (ICCCS), 14–20. Singapore: IEEE Press.

Rybicki, Jan. 2016. Vive la différence: Tracing the (authorial) gender signal by multivariate analysis of word frequencies. Digital Scholarship in the Humanities 31(4). 746–761.

Schwartz, Roy, Oren Tsur, Ari Rappoport & Moshe Koppel. 2013. Authorship attribution of micro-messages. In David Yarowsky, Timothy Baldwin, Anna Korhonen, Karen Livescu & Steven Bethard (eds.), Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Seattle, WA: Association for Computational Linguistics. [URL] (7 September, 2020.)

Silessi, Shannon, Cihan Varol & Murat Karabatak. 2016. Identifying gender from SMS text messages. In 15th IEEE International Conference on Machine Learning and Applications (ICMLA), 488–491. Anaheim, CA: IEEE.

Walkowiak, Tomasz & Maciej Piasecki. 2018. Stylometry analysis of literary texts in Polish. In Leszek Rutkowski, Rafał Scherer, Marcin Korytkowski, Witold Pedrycz, Ryszard Tadeusiewicz & Jacek M. Zadura (eds.) Artificial Intelligence and Soft Computing (Lecture notes in Artificial Intelligence 10842), 777–787. Cham: Springer.

Cited by (1)

Cited by one other publication

ROSZKOWSKI, MARCIN

2022. BIBLIOGRAPHIC DATA SCIENCE – KONCEPTUALIZACJA OBSZARU BADAWCZEGO. Przegląd Biblioteczny 90:1 ► pp. 5 ff.

This list is based on CrossRef data as of 4 july 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.