Article published in:Directions in Empirical Literary Studies: In honor of Willie van Peer
Edited by Sonia Zyngier, Marisa Bortolussi, Anna Chesnokova and Jan Auracher
[Linguistic Approaches to Literature 5] 2008
► pp. 175–191
Computationally Discriminating Literary from Non-Literary Texts
Three computational linguistic methods are presented to discriminate literary from non-literary texts. In the first study, a hierarchical clustering technique of results obtained from Latent Semantic Analysis showed a clustering of literary versus non-literary texts. The second study used the frequencies of shared bigrams across the text, resulting in a 100% correct classification of literary versus non-literary texts. The third study used unigrams yielding a 94% correct classification into literary versus non-literary texts. The final two studies using a larger sample of texts showed that the high classification performance cannot be attributed to specific texts. These findings provide evidence that distinguishing literature from non-literature can be done with high accuracy and with relatively simple computational linguistic techniques.
Keywords: bigram analysis, classification techniques, computational linguistics, genre, latent semantic analysis, stylistics
Published online: 15 May 2008
Cited by other publications
Gavaler, Chris & Dan Johnson
Mar, Raymond A.
McCarthy, Kathryn S.
van Cranenburgh, Andreas, Karina van Dalen-Oskam & Joris van Zundert
This list is based on CrossRef data as of 21 november 2020. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.