Chapter published in:
Language and Text: Data, models, information and applicationsEdited by Adam Pawłowski, Jan Mačutek, Sheila Embleton and George Mikros
[Current Issues in Linguistic Theory 356] 2021
► pp. 225–238
Book genre and author’s gender recognition based on titles
The example of the bibliographic corpus of microtexts
Adam Pawłowski | University of Wrocław
Elżbieta Herden | University of Wrocław
Tomasz Walkowiak | Wrocław University of Science and Technology
The subject of this chapter is the application of automatic taxonomy methods to the corpus of microtexts, consisting of book titles. We test two hypotheses. The first one claims that simply on the basis of a book title one can automatically recognize its genre (writing species). The second assumes the possibility of recognizing the author’s gender on the basis of the book’s title. FastText and word2vec methods were applied. The analyses give a positive (and rather astonishing) result: with properly chosen n-grams more than 70% of titles could be correctly assigned a writing species, while the accuracy of the gender recognition of the author was almost 80%. Both values significantly exceed the levels of random recognition. The research was conducted on the corpus of titles derived from the Polish national bibliography.
Keywords: corpus linguistics, automatic taxonomy, gender recognition, book genre, fastText, word2vec, bibliography, Polish
Article outline
- 1.The problem
- 2.Data and research hypotheses
- 3.Methodology
- 4.Experiments and results
- 4.1Recognizing the literary genre of the text
- 4.2Automatic recognition of the author’s gender
- 5.Conclusions
-
Notes -
References
Published online: 22 December 2021
https://doi.org/10.1075/cilt.356.15paw
https://doi.org/10.1075/cilt.356.15paw
References
Chiang, Holly, Yifan Ge & Connie Wu
2015 Classification of book genres by cover and title. http://cs229.stanford.edu/proj2015/127_report.pdf (7 September 2020).
Goodman, Joshua
2001 Classes for fast maximum entropy training. In 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing, I-561–I-564. Salt Late City, UT: IEEE. https://arxiv.org/pdf/cs/0108006.pdf (7 September 2020.) 
Grave, Edouard, Piotr Bojanowski, Prakhar Gupta, Armand Joulin & Tomas Mikolov
2018 Learning word vectors for 157 languages. In Nicoletta Calzolari et al. (eds.), Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018). Miyazaki, Japan: European Language Resources Association (ELRA). https://www.aclweb.org/anthology/L18-1550.pdf (7 September 2020).
Hastie, Trevor, Robert Tibshirani & Jerome Friedman
Joulin, Armand, Edouard Grave, Piotr Bojanowski & Tomas Mikolov
2017 Bag of tricks for efficient text classification. In Mirella Lapata, Phil Blunsom & Alexander Koller (eds.), Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, 427–431. Valencia, Spain: Association for Computational Linguistics. https://www.aclweb.org/anthology/E17-2068.pdf (7 September 2020) 
Le, Quoc & Tomas Mikolov
2014 Distributed representations of sentences and documents. In Eric P. Xing & Tony Jebara (eds.), Proceedings of the 31st International Conference on Machine Learning, 1188–1196. Bejing: JMLR. http://proceedings.mlr.press/v32/le14.pdf (7 September 2020).
Mikolov, Tomas, Wen-tau Yih & Geoffrey Zweig
2013 Linguistic regularities in continuous space word representations. In Lucy Vanderwende, Hal Daumé III & Katrin Kirchhoff, Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Atlanta: Association for Computational Linguistics, 746–751. https://www.aclweb.org/anthology/N13-1090.pdf (7 September 2020).
Mikros, George K.
2013 Systematic stylometric differences in men and women authors: a corpus-based study. In Reinhard Köhler & Gabriel Altmann (eds.), Issues in Quantitative Linguistics 3: Dedicated to Karl-Heinz Best on the Occasion of His 70th Birthday, 206–223, Lüdenscheid: RAM–Verlag. https://www.academia.edu/3429459/Systematic_stylometric_differences_in_men_and_women_authors_a_corpus-based_study (7 September 2020.)
Mikros, George K. & Kostas Perifanos
2013 Authorship attribution in Greek tweets using author’s multilevel n-gram profiles. In Eduard Hovy, Vita Markman, Craig Martell & David Uthus (eds.), AAAI Spring Symposium: Analyzing Microtext. https://www.aaai.org/ocs/index.php/SSS/SSS13/paper/viewFile/5714/5914 (7 September 2020.)
Ozsarfati, Eran, Egemen Sahin, Can J. Saul & Alper Yilmaz
Rybicki, Jan
Schwartz, Roy, Oren Tsur, Ari Rappoport & Moshe Koppel
2013 Authorship attribution of micro-messages. In David Yarowsky, Timothy Baldwin, Anna Korhonen, Karen Livescu & Steven Bethard (eds.), Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Seattle, WA: Association for Computational Linguistics. https://www.aclweb.org/anthology/D13-1193.pdf (7 September 2020.)
Silessi, Shannon, Cihan Varol & Murat Karabatak
Walkowiak, Tomasz & Maciej Piasecki