Book genre and author’s gender recognition based on titles
The example of the bibliographic corpus of microtexts
The subject of this chapter is the application of automatic taxonomy methods to the corpus of microtexts, consisting of book titles. We test two hypotheses. The first one claims that simply on the basis of a book title one can automatically recognize its genre (writing species). The second assumes the possibility of recognizing the author’s gender on the basis of the book’s title. FastText and word2vec methods were applied. The analyses give a positive (and rather astonishing) result: with properly chosen n-grams more than 70% of titles could be correctly assigned a writing species, while the accuracy of the gender recognition of the author was almost 80%. Both values significantly exceed the levels of random recognition. The research was conducted on the corpus of titles derived from the Polish national bibliography.
Article outline
- 1.The problem
- 2.Data and research hypotheses
- 3.Methodology
- 4.Experiments and results
- 4.1Recognizing the literary genre of the text
- 4.2Automatic recognition of the author’s gender
- 5.Conclusions
-
Notes
-
References
References
Chiang, Holly, Yifan Ge & Connie Wu
2015 Classification of book genres by cover and title.
[URL] (
7 September 2020).
Goodman, Joshua
2001 Classes for fast maximum entropy training. In
2001 IEEE International Conference on Acoustics, Speech, and Signal Processing, I-561–I-564. Salt Late City, UT: IEEE.
[URL] (
7 September 2020.)

Grave, Edouard, Piotr Bojanowski, Prakhar Gupta, Armand Joulin & Tomas Mikolov
2018 Learning word vectors for 157 languages. In
Nicoletta Calzolari et al. (eds.),
Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018). Miyazaki, Japan: European Language Resources Association (ELRA).
[URL] (
7 September 2020).
Harris, Zellig S.
1954 Distributional structure. WORD 10(2–3). 146–162.


Hastie, Trevor, Robert Tibshirani & Jerome Friedman
2013 The elements of statistical learning: Data mining, inference and prediction (
Springer series in statistics). New York: Springer.

Joulin, Armand, Edouard Grave, Piotr Bojanowski & Tomas Mikolov
2017 Bag of tricks for efficient text classification. In
Mirella Lapata,
Phil Blunsom &
Alexander Koller (eds.),
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, 427–431. Valencia, Spain: Association for Computational Linguistics.
[URL] (
7 September 2020)

Le, Quoc & Tomas Mikolov
2014 Distributed representations of sentences and documents. In
Eric P. Xing &
Tony Jebara (eds.),
Proceedings of the 31st International Conference on Machine Learning, 1188–1196. Bejing: JMLR.
[URL] (
7 September 2020).
Mikolov, Tomas, Wen-tau Yih & Geoffrey Zweig
2013 Linguistic regularities in continuous space word representations. In
Lucy Vanderwende,
Hal Daumé III &
Katrin Kirchhoff,
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Atlanta: Association for Computational Linguistics, 746–751.
[URL] (
7 September 2020).
Mikros, George K.
2013 Systematic stylometric differences in men and women authors: a corpus-based study. In
Reinhard Köhler &
Gabriel Altmann (eds.),
Issues in Quantitative Linguistics 3: Dedicated to Karl-Heinz Best on the Occasion of His 70th Birthday, 206–223, Lüdenscheid: RAM–Verlag.
[URL] (
7 September 2020.)
Mikros, George K. & Kostas Perifanos
2013 Authorship attribution in Greek tweets using author’s multilevel n-gram profiles. In
Eduard Hovy,
Vita Markman,
Craig Martell &
David Uthus (eds.),
AAAI Spring Symposium: Analyzing Microtext.
[URL] (
7 September 2020.)
Ozsarfati, Eran, Egemen Sahin, Can J. Saul & Alper Yilmaz
2019 Book genre classification based on titles with comparative machine learning algorithms. In
IEEE 4th International Conference on Computer and Communication Systems (ICCCS), 14–20. Singapore: IEEE Press.


Rybicki, Jan
2016 Vive la différence: Tracing the (authorial) gender signal by multivariate analysis of word frequencies.
Digital Scholarship in the Humanities 31(4). 746–761.


Schwartz, Roy, Oren Tsur, Ari Rappoport & Moshe Koppel
2013 Authorship attribution of micro-messages. In
David Yarowsky,
Timothy Baldwin,
Anna Korhonen,
Karen Livescu &
Steven Bethard (eds.),
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Seattle, WA: Association for Computational Linguistics.
[URL] (
7 September 2020.)
Silessi, Shannon, Cihan Varol & Murat Karabatak
2016 Identifying gender from SMS text messages. In
15th IEEE International Conference on Machine Learning and Applications (ICMLA), 488–491. Anaheim, CA: IEEE.


Walkowiak, Tomasz & Maciej Piasecki
2018 Stylometry analysis of literary texts in Polish. In
Leszek Rutkowski,
Rafał Scherer,
Marcin Korytkowski,
Witold Pedrycz,
Ryszard Tadeusiewicz &
Jacek M. Zadura (eds.)
Artificial Intelligence and Soft Computing (
Lecture notes in Artificial Intelligence 10842), 777–787. Cham: Springer.


Cited by
Cited by 1 other publications
ROSZKOWSKI, MARCIN
2022.
BIBLIOGRAPHIC DATA SCIENCE – KONCEPTUALIZACJA OBSZARU BADAWCZEGO.
Przegląd Biblioteczny 90:1
► pp. 5 ff.

This list is based on CrossRef data as of 2 january 2023. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.