Be positive
Combining DocuScope with non-negative matrix factorization for
topic discovery
This chapter proposes a novel method that deploys
non-negative matrix factorization to extract topic models from
texts. This topic modeling process reveals how terms and DocuScope
Language Action Type Analysis (LATs) align, providing robust
information on what texts are about and how they are organized
rhetorically. Moreover, the non-negative nature of the topics means
that each derived topic can be viewed as a sum of topical features,
which can greatly ease the interpretive process. To elucidate and
benchmark this method, I apply it to a well-known 20
Newsgroups dataset and sample the results.
Article outline
- 1.Introduction
- 2.Non-negative matrix factorization for topic modeling
- 3.Methodology
- 4.Results and discussion
- 5.Conclusions
-
Notes
-
References
References (58)
References
Allison, S., Heuser, R., Jockers, M., Moretti, F., & Witmore, M. (2011). Quantitative
formalism: An
experiment. Stanford Literary Lab.
Angelov, D. (2020). Top2vec:
Distributed representations of
topics. arXiv:2008.09470.
Arora, S., Ge, R., Halpern, Y., Mimno, D., Moitra, A., Sontag, D., Wu, Y., & Zhu, M. (2013, May). A
practical algorithm for topic modeling with provable
guarantees. In Proceedings
of the 31st International Conference on Machine
Learning (pp. 280–288). PMLR.
Arora, S., Ge, R., Kannan, R., & Moitra, A. (2016). Computing
a nonnegative matrix factorization –
Provably. SIAM Journal on
Computing, 45(4), 1582–1611.
Arroyo-Fernández, I., Méndez-Cruz, C. F., Sierra, G., Torres-Moreno, J. M., & Sidorov, G. (2019). Unsupervised
sentence representations as word information series:
Revisiting TF–IDF. Computer
Speech &
Language, 56, 107–129.
Basu, A., Hope, J., & Witmore, M. (2017). The
professional and linguistic communities of early modern
dramatists. In A. W. Johnson, R. D. Sell, & H. Wilcox (Eds.), Community-making
in early Stuart theatres: Stage and
audience. Routledge.
Bernstein, S. D., & Derose, C. (2012). Reading
numbers by numbers: Digital studies and the Victorian serial
novel. Victorian
Review, 38(2), 43–68.
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent
dirichlet allocation. The
Journal of Machine Learning
Research, 3, 993–1022.
Cai, D., He, X., Wu, X., & Han, J. (2008, December). Non-negative
matrix factorization on
manifold. In 2008
Eighth IEEE International Conference on Data
Mining (pp. 63–72). IEEE.
Chen, C. H. (2017). Improved
TFIDF in big news retrieval: An empirical
study. Pattern Recognition
Letters, 93, 113–122.
Cichocki, A., & Phan, A. H. (2009). Fast
local algorithms for large scale nonnegative matrix and
tensor factorizations. IEICE
Transactions on Fundamentals of Electronics Communications
and Computer
Sciences, 92(3), 708–721.
Correll, M., Witmore, M., & Gleicher, M. (2011). Exploring
collections of tagged text for literary
scholarship. Computer
Graphics
Forum, 30(3), 731–740.
Danescu-Niculescu-Mizil, C., & Lee, L. (2011). Chameleons
in imagined conversations: A new approach to understanding
coordination of linguistic style in
dialogs. arXiv:1106.3077.
Explosion. 2022. spaCy. Retrieved
on 25 January
2023 from [URL]
Févotte, C., & Idier, J. (2011). Algorithms
for nonnegative matrix factorization with the
β-divergence. Neural
Computation, 23(9), 2421–2456.
Forsyth, E., Lin, J., & Martell, C. (n.d.). The
NPS Chat Corpus
[Dataset]. Retrieved
on 25 January
2023 from [URL]
Forsyth, E., & Martell, C. H. (2007). Lexical
and discourse analysis of online chat
dialog. In Proceedings
of the First IEEE International Conference on Semantic
Computing (ICSC
2007) (pp. 19–26).
Geisler, C., & Swarts, J. (2019). Coding
streams of language: Techniques for the systematic coding of
text, talk, and other verbal
data. WAC Clearinghouse.
Goldstone, A., & Underwood, T. (2014). The
quiet transformations of literary studies: What thirteen
thousand scholars could tell
us. New Literary
History, 45(3), pp. 359–384.
Grabill, J. T., & Pigg, S. (2012). Messy
rhetoric: Identity performance as rhetorical agency in
online public
forums. Rhetoric Society
Quarterly, 42(2), 99–119.
Grisel, O., Buitink, L., & Yau, C. K. (n.d.). Topic
extraction with non-negative matrix factorization and latent
dirichlet
allocation. [Computer
Code]. Retrieved
on 25 January
2023 from [URL]
Grootendorst, M. (2020). BERTopic:
Leveraging BERT and c-TF-IDF to create easily interpretable
topics.
Havrlant, L., & Kreinovich, V. (2017). A
simple probabilistic explanation of term frequency-inverse
document frequency (tf-idf) heuristic (and variations
motivated by this
explanation). International
Journal of General
Systems, 46(1), 27–36.
He, R., & McAuley, J. (2016). Ups
and downs: Modeling the visual evolution of fashion trends
with one-class collaborative
filtering. arXiv:1602.01585.
Hoffman, M., Bach, F. R., & Blei, D. M. (2010). Online
learning for latent dirichlet
allocation. In M. I. Jordan, Y. LeCun, & S. A. Solla (Eds.), Advances
in neural information processing
systems (pp. 856–864). The MIT Press.
Hope, J., & Witmore, M. (2004). The
very large textual object: A prosthetic reading of
Shakespeare. Early Modern
Literary
Studies, 9(3), 1–36.
Hope, J., & Witmore, M. (2010). The
hundredth psalm to the tune of” Green Sleeves”: Digital
approaches to Shakespeare’s language of
genre. Shakespeare
Quarterly, 61(3), 357–390.
Hoyer, P. O. (2004). Non-negative
matrix factorization with sparseness
constraints. Journal of
Machine Learning
Research, 5, 1457–1469.
Jockers, M. (2013). Macroanalysis:
Digital methods and literary
history. University of Illinois Press.
Jockers, Matthew. (n.d.). 500
Themes from a corpus of 19th-Century
fiction. Retrieved 17 April
2019 from [URL]
Johnson, C., & Marcellino, W. (2022). Bag-of-words
algorithms can supplement transformer sequence
classification & improve model
interpretability. RAND Corporation. Retrieved
on 25 January
2023 from [URL]
Kane, M. S. (2020, October). Communicating
the “write” values: Developing methods of computer-aided
text analysis for instructor
training. In Proceedings
of the 38th ACM International Conference on Design of
Communication (pp. 1–8). ACM.
Kaufer, D. S., & Butler, B. S. (2010). Rhetoric
and the arts of
design. Routledge.
Kaufer, D. S., & Ishizaki, S. (1998). DocuScope:
Computer-aided rhetorical
analysis [Software].
Kaufer, D., & Ishizaki, S. (2006). A
corpus study of canned letters: Mining the latent rhetorical
proficiencies marketed to writers-in-a-hurry and
non-writers. IEEE
Transactions on Professional
Communication, 49(3), 254–266.
Kaufer, D. S., Ishizaki, S., Butler, B. S., & Collins, J. (2004). The
power of words: Unveiling the speaker and writer’s hidden
craft. Routledge.
Beigman Klebanov, B. B., Kaufer, D., Yeoh, P., Ishizaki, S., & Holtzman, S. (2016). Argumentative
writing in assessment and instruction: A comparative
perspective. Genre in
Language, Discourse and
Cognition, 33, 167.
Kuang, D., Choo, J., & Park, H. (2015). Nonnegative
matrix factorization for interactive topic modeling and
document
clustering. In Partitional
clustering
algorithms (pp. 215–243). Springer.
Lauer, C., Brumberger, E., & Beveridge, A. (2018). Hand
collecting and coding versus data-driven methods in
technical and professional communication
research. IEEE Transactions
on Professional
Communication, 61(4), 389–408.
Le, Q., & Mikolov, T. (2014, June). Distributed
representations of sentences and
documents. In Proceedings
of the 31st International Conference on Machine
Learning (pp. 1188–1196). PMLR.
Lee, D. D., & Seung, H. S. (1999). Learning
the parts of objects by non-negative matrix
factorization. Nature, 401(6755), 788–791.
Marcellino, W. (2019). Seniority
in writing studies: A corpus
analysis. Journal of Writing
Analytics, 3(1), 183–205.
McAuley, J., Targett, C., Shi, J., & van den Hengel, A. (2015). Image-based
recommendations on styles and
substitutes. SIGIR.
Ni, J., Li, J., & McAuley, J. (2019, November). Justifying
recommendations using distantly-labeled reviews and
fine-grained
aspects. In Proceedings
of the 2019 Conference on Empirical Methods in Natural
Language Processing and the 9th International Joint
Conference on Natural Language
Processing (EMNLP-IJCNLP) (pp. 188–197).
Omizo, R. M. (2020). Machining
Topoi: Tracking premising in online discussion forums with
automated rhetorical move
analysis. Computers and
Composition, 57, 102578.
Paatero, P. (1997). Least
squares formulation of robust non-negative factor
analysis. Chemometrics and
Intelligent Laboratory
Systems, 37(1), 23–35.
Paatero, P., & Tapper, U. (1994). Positive
matrix factorization: A non-negative factor model with
optimal utilization of error estimates of data
values. Environmetrics, 5(2), 111–126.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825–2830.
Piper, A. (2018). Enumerations:
Data and literary study. The University of Chicago Press.
Řehůřek, R., & Sojka, P. (2011). Gensim –
Statistical semantics in
python. Retrieved
on 25 January
2023 from [URL]
Reimers, N., & Gurevych, I. (2019). Sentence-bert:
Sentence embeddings using siamese
bert-networks. arXiv:1908.10084.
Seung, D., & Lee, L. (2001). Algorithms
for non-negative matrix
factorization. Advances in
Neural Information Processing
Systems, 13, 556–562.
Steyvers, M., & Griffiths, T. (2007). Probabilistic
topic
models. In T. K. Landauer, D. S. McNamara, S. Dennis, & W. Kintsch (Eds.), Handbook
of latent semantic
analysis (427(7), pp. 424–440). Lawrence Erlbaum Associates.
Vanderplas, J. T. (2016). Python
data science handbook: Essential tools for working with
data. O’Reilly.
Wang, Y. X., & Zhang, Y. J. (2012). Nonnegative
matrix factorization: A comprehensive
review. IEEE Transactions on
Knowledge and Data
Engineering, 25(6), 1336–1353.
Wetzel, D., Brown, D., Werner, N., Ishizaki, S., & Kaufer, D. (2021). Computer-assisted
rhetorical analysis: Instructional design and formative
assessment using
DocuScope. The Journal of
Writing
Analytics, 5, 292–323.
Xu, W., Liu, X., & Gong, Y. (2003, July). Document
clustering based on non-negative matrix
factorization. In Proceedings
of the 26th annual international ACM SIGIR Conference on
Research and Development in Information
Retrieval (pp. 267–273).
Zhu, J., Wickes, E., & Gallagher, J. R. (2021). A
machine learning algorithm for sorting online comments via
topic modeling. Communication
Design
Quarterly, 9(2), 4–14.