Edited by Giovanni Parodi †
[Studies in Corpus Linguistics 40] 2010
► pp. 121–141
Chapter 7. Automatic text classification of disciplinary texts
The aim of this research is to classify, using and comparing two automatic classification methods, the academic texts included in the PUCV-2006 Corpus of Spanish. The methods are based on shared lexical-semantic content words present in the corpus of academic texts. The classification methods compared in this study are Multinomial Naive Bayes and Support Vector Machine. Both enable the identification of a small group of shared words that help, according to statistical weights, to classify a new text into the four disciplinary areas involved in the corpora. The results allow us to establish that Support Vector Machine classifies academic texts efficiently. Using this method, we were able to automatically identify the disciplinary domain of an academic text – based on a reduced number of shared content lexemes – delivering high performance even in highly-refined disciplines such as Psychology and Social Work.