Combining DocuScope with non-negative matrix factorization for topic discovery: Be positive

Omizo, Ryan M.

doi:10.1075/scl.109.08omi

Part of

Corpora and Rhetorically Informed Text Analysis: The diverse applications of DocuScope
Edited by David West Brown and Danielle Zawodny Wetzel
[Studies in Corpus Linguistics 109] 2023
► pp. 167–189

Be positive

Combining DocuScope with non-negative matrix factorization for topic discovery

Ryan M. Omizo | Temple University

This chapter proposes a novel method that deploys non-negative matrix factorization to extract topic models from texts. This topic modeling process reveals how terms and DocuScope Language Action Type Analysis (LATs) align, providing robust information on what texts are about and how they are organized rhetorically. Moreover, the non-negative nature of the topics means that each derived topic can be viewed as a sum of topical features, which can greatly ease the interpretive process. To elucidate and benchmark this method, I apply it to a well-known 20 Newsgroups dataset and sample the results.

Keywords: topic modeling, non-negative matrix factorization, latent dirichlet allocation, principal component analysis, factor analysis, computational text analysis, digital humanities, rhetoric, text mining, rhetorical primers

Article outline

1.Introduction
2.Non-negative matrix factorization for topic modeling
3.Methodology
- 3.1Data and methods
4.Results and discussion
5.Conclusions
Notes
References

Published online: 29 June 2023

https://doi.org/10.1075/scl.109.08omi

References (58)

References

Allison, S., Heuser, R., Jockers, M., Moretti, F., & Witmore, M. (2011). Quantitative formalism: An experiment. Stanford Literary Lab.

Angelov, D. (2020). Top2vec: Distributed representations of topics. arXiv:2008.09470.

Arora, S., Ge, R., Halpern, Y., Mimno, D., Moitra, A., Sontag, D., Wu, Y., & Zhu, M. (2013, May). A practical algorithm for topic modeling with provable guarantees. In Proceedings of the 31st International Conference on Machine Learning (pp. 280–288). PMLR.

Arora, S., Ge, R., Kannan, R., & Moitra, A. (2016). Computing a nonnegative matrix factorization – Provably. SIAM Journal on Computing, 45(4), 1582–1611.

Arroyo-Fernández, I., Méndez-Cruz, C. F., Sierra, G., Torres-Moreno, J. M., & Sidorov, G. (2019). Unsupervised sentence representations as word information series: Revisiting TF–IDF. Computer Speech & Language, 56, 107–129.

Basu, A., Hope, J., & Witmore, M. (2017). The professional and linguistic communities of early modern dramatists. In A. W. Johnson, R. D. Sell, & H. Wilcox (Eds.), Community-making in early Stuart theatres: Stage and audience. Routledge.

Bernstein, S. D., & Derose, C. (2012). Reading numbers by numbers: Digital studies and the Victorian serial novel. Victorian Review, 38(2), 43–68.

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. The Journal of Machine Learning Research, 3, 993–1022.

Cai, D., He, X., Wu, X., & Han, J. (2008, December). Non-negative matrix factorization on manifold. In 2008 Eighth IEEE International Conference on Data Mining (pp. 63–72). IEEE.

Chen, C. H. (2017). Improved TFIDF in big news retrieval: An empirical study. Pattern Recognition Letters, 93, 113–122.

Cichocki, A., & Phan, A. H. (2009). Fast local algorithms for large scale nonnegative matrix and tensor factorizations. IEICE Transactions on Fundamentals of Electronics Communications and Computer Sciences, 92(3), 708–721.

Correll, M., Witmore, M., & Gleicher, M. (2011). Exploring collections of tagged text for literary scholarship. Computer Graphics Forum, 30(3), 731–740.

Danescu-Niculescu-Mizil, C., & Lee, L. (2011). Chameleons in imagined conversations: A new approach to understanding coordination of linguistic style in dialogs. arXiv:1106.3077.

Explosion. 2022. spaCy. Retrieved on 25 January 2023 from [URL]

Févotte, C., & Idier, J. (2011). Algorithms for nonnegative matrix factorization with the β-divergence. Neural Computation, 23(9), 2421–2456.

Forsyth, E., Lin, J., & Martell, C. (n.d.). The NPS Chat Corpus [Dataset]. Retrieved on 25 January 2023 from [URL]

Forsyth, E., & Martell, C. H. (2007). Lexical and discourse analysis of online chat dialog. In Proceedings of the First IEEE International Conference on Semantic Computing (ICSC 2007) (pp. 19–26).

Geisler, C., & Swarts, J. (2019). Coding streams of language: Techniques for the systematic coding of text, talk, and other verbal data. WAC Clearinghouse.

Goldstone, A., & Underwood, T. (2014). The quiet transformations of literary studies: What thirteen thousand scholars could tell us. New Literary History, 45(3), pp. 359–384.

Grabill, J. T., & Pigg, S. (2012). Messy rhetoric: Identity performance as rhetorical agency in online public forums. Rhetoric Society Quarterly, 42(2), 99–119.

Grisel, O., Buitink, L., & Yau, C. K. (n.d.). Topic extraction with non-negative matrix factorization and latent dirichlet allocation. [Computer Code]. Retrieved on 25 January 2023 from [URL]

Grootendorst, M. (2020). BERTopic: Leveraging BERT and c-TF-IDF to create easily interpretable topics.

Havrlant, L., & Kreinovich, V. (2017). A simple probabilistic explanation of term frequency-inverse document frequency (tf-idf) heuristic (and variations motivated by this explanation). International Journal of General Systems, 46(1), 27–36.

He, R., & McAuley, J. (2016). Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. arXiv:1602.01585.

Hoffman, M., Bach, F. R., & Blei, D. M. (2010). Online learning for latent dirichlet allocation. In M. I. Jordan, Y. LeCun, & S. A. Solla (Eds.), Advances in neural information processing systems (pp. 856–864). The MIT Press.

Hope, J., & Witmore, M. (2004). The very large textual object: A prosthetic reading of Shakespeare. Early Modern Literary Studies, 9(3), 1–36.

(2010). The hundredth psalm to the tune of” Green Sleeves”: Digital approaches to Shakespeare’s language of genre. Shakespeare Quarterly, 61(3), 357–390.

Hoyer, P. O. (2004). Non-negative matrix factorization with sparseness constraints. Journal of Machine Learning Research, 5, 1457–1469.

Jockers, M. (2013). Macroanalysis: Digital methods and literary history. University of Illinois Press.

Jockers, Matthew. (n.d.). 500 Themes from a corpus of 19th-Century fiction. Retrieved 17 April 2019 from [URL]

Johnson, C., & Marcellino, W. (2022). Bag-of-words algorithms can supplement transformer sequence classification & improve model interpretability. RAND Corporation. Retrieved on 25 January 2023 from [URL]

Kane, M. S. (2020, October). Communicating the “write” values: Developing methods of computer-aided text analysis for instructor training. In Proceedings of the 38th ACM International Conference on Design of Communication (pp. 1–8). ACM.

Kaufer, D. S., & Butler, B. S. (2010). Rhetoric and the arts of design. Routledge.

Kaufer, D. S., & Ishizaki, S. (1998). DocuScope: Computer-aided rhetorical analysis [Software].

Kaufer, D., & Ishizaki, S. (2006). A corpus study of canned letters: Mining the latent rhetorical proficiencies marketed to writers-in-a-hurry and non-writers. IEEE Transactions on Professional Communication, 49(3), 254–266.

Kaufer, D. S., Ishizaki, S., Butler, B. S., & Collins, J. (2004). The power of words: Unveiling the speaker and writer’s hidden craft. Routledge.

Beigman Klebanov, B. B., Kaufer, D., Yeoh, P., Ishizaki, S., & Holtzman, S. (2016). Argumentative writing in assessment and instruction: A comparative perspective. Genre in Language, Discourse and Cognition, 33, 167.

Kuang, D., Choo, J., & Park, H. (2015). Nonnegative matrix factorization for interactive topic modeling and document clustering. In Partitional clustering algorithms (pp. 215–243). Springer.

Lauer, C., Brumberger, E., & Beveridge, A. (2018). Hand collecting and coding versus data-driven methods in technical and professional communication research. IEEE Transactions on Professional Communication, 61(4), 389–408.

Le, Q., & Mikolov, T. (2014, June). Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on Machine Learning (pp. 1188–1196). PMLR.

Lee, D. D., & Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755), 788–791.

Marcellino, W. (2019). Seniority in writing studies: A corpus analysis. Journal of Writing Analytics, 3(1), 183–205.

McAuley, J., Targett, C., Shi, J., & van den Hengel, A. (2015). Image-based recommendations on styles and substitutes. SIGIR.

Ni, J., Li, J., & McAuley, J. (2019, November). Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (pp. 188–197).

Omizo, R. M. (2020). Machining Topoi: Tracking premising in online discussion forums with automated rhetorical move analysis. Computers and Composition, 57, 102578.

Paatero, P. (1997). Least squares formulation of robust non-negative factor analysis. Chemometrics and Intelligent Laboratory Systems, 37(1), 23–35.

Paatero, P., & Tapper, U. (1994). Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values. Environmetrics, 5(2), 111–126.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825–2830.

Piper, A. (2018). Enumerations: Data and literary study. The University of Chicago Press.

Řehůřek, R., & Sojka, P. (2011). Gensim – Statistical semantics in python. Retrieved on 25 January 2023 from [URL]

Reimers, N., & Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv:1908.10084.

Seung, D., & Lee, L. (2001). Algorithms for non-negative matrix factorization. Advances in Neural Information Processing Systems, 13, 556–562.

Steyvers, M., & Griffiths, T. (2007). Probabilistic topic models. In T. K. Landauer, D. S. McNamara, S. Dennis, & W. Kintsch (Eds.), Handbook of latent semantic analysis (427(7), pp. 424–440). Lawrence Erlbaum Associates.

Vanderplas, J. T. (2016). Python data science handbook: Essential tools for working with data. O’Reilly.

Wang, Y. X., & Zhang, Y. J. (2012). Nonnegative matrix factorization: A comprehensive review. IEEE Transactions on Knowledge and Data Engineering, 25(6), 1336–1353.

Wetzel, D., Brown, D., Werner, N., Ishizaki, S., & Kaufer, D. (2021). Computer-assisted rhetorical analysis: Instructional design and formative assessment using DocuScope. The Journal of Writing Analytics, 5, 292–323.

Xu, W., Liu, X., & Gong, Y. (2003, July). Document clustering based on non-negative matrix factorization. In Proceedings of the 26th annual international ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 267–273).

Zhu, J., Wickes, E., & Gallagher, J. R. (2021). A machine learning algorithm for sorting online comments via topic modeling. Communication Design Quarterly, 9(2), 4–14.