A corpus-driven exploration: Chapter 3. Medical topics and style from 1500 to 2018

Schneider, Gerold

doi:10.1075/pbns.330.03sch

Part of

Corpus Pragmatic Studies on the History of Medical Discourse
Edited by Turo Hiltunen and Irma Taavitsainen
[Pragmatics & Beyond New Series 330] 2022
► pp. 49–78

Chapter 3
Medical topics and style from 1500 to 2018

A corpus-driven exploration

Gerold Schneider | University of Zurich

This chapter investigates changes in medical topics, style and language across 500 years, from 1500 to 2018. To do so, we employ data-driven methods of Computational Linguistics and Digital Humanities: document classification, topic modelling, and automatically constructed conceptual maps. We trace changes from traditional thinking in the scholastic period to empirical methods, professionalised medicine, and finally the increasing importance of data, statistics and clinical studies, away from symptom-centred medicine. We conclude that medical discourse has undergone radical changes and that data-driven methods reflect these changes and offer an unprecedented overview. We also critically discuss shortcomings of our data and methods.

Keywords: data-driven approaches, machine learning, collocations, Topic Modelling, history of medicine, Digital Humanities, conceptual maps, Kernel Density Estimation, automated content analysis, English medical discourse, language and health, culturomics

Article outline

1.Introduction
2.Motivation
- 2.1Systematic comparison of all lexical features
- 2.2Advanced computational methods
- 2.3Sampling and representativeness
3.Materials
- 3.1CEEM
- 3.2ARCHER Medical
- 3.3HIMERA
- 3.4PubMed Excerpt
- 3.5Overview of the complete data of our investigation
- 3.6Limitations of the data
4.Methods
- 4.1Data preparation
- 4.2Supervised document classification
- 4.3Unsupervised topic modelling
- 4.4Unsupervised Conceptual Maps with Kernel Density Estimation
5.Results
- 5.1Results of supervised document classification
- 5.2Results of unsupervised topic modelling
- 5.3Results of Unsupervised Conceptual Maps with Kernel Density Estimation
6.Conclusion and future prospects
Acknowledgements
Notes
References

Published online: 1 July 2022

https://doi.org/10.1075/pbns.330.03sch

References

Ananiadou, Sophia, Douglas B. Kell, and Tsujii, Jun-ichi

2006 “Text Mining and Its Potential Applications in Systems Biology.” Trends in Biotechnology 24 (12): 571–579.

Baron, Alistair, and Paul Rayson

2008 “VARD 2: A Tool for Dealing with Spelling Variation in Historical Corpora.” In Proceedings of the Postgraduate Conference in Corpus Linguistics, Aston University, Birmingham, UK 22 May 2008 [URL]

Baroni, Marco, and Alessandro Lenci

2010 “Distributional Memory: A General Framework for Corpus-based Semantics.” Computational Linguistics 36 (4): 673–721.

Bazerman, Charles

1988 Shaping Written Knowledge. Madison: University of Wisconsin.

Biber, Douglas, Edward Finegan, and Dwight Atkinson

1994 “ARCHER and Its Challenges: Compiling and Exploring a Representative Corpus of Historical English Registers.” Creating and Using English Language Corpora: Papers from the 14th International Conference on English Language Research on Computerized Corpora, Zürich 1994, ed. by Udo Fries, Peter Schneider, and Gunnel Tottie, 1–13. Amsterdam: Rodopi.

Blei, David

2012 “Probabilistic Topic Models.” Communications of the ACM 55 (4): 77–84.

Broersma, Marcel, and Frank Harbers

2018 “Exploring Machine Learning to Study the Long-Term Transformation of News.” Digital Journalism 6 (9): 1150–1164.

Bybee, Joan

2007 Frequency of Use and the Organization of Language. Oxford: Oxford University Press.

Church, Kenneth

2000 “Empirical Estimates of Adaptation: The chance of Two Noriegas is closer to p/2 than p².” Proceedings of the 17th Conference on Computational linguistics (COLING 2000), 180–186. Stroudsburg: Association for Computational Linguistics.

Conklin, Kathy, and Norbert Schmitt

2012 “The Processing of Formulaic Language.” Annual Review of Applied Linguistics 32: 45–61.

Erman, Britt and Beatrice Warren

2000 “The Idiom Principle and the Open Choice Principle.” TEXT 20 (1): 29–62.

Firth, John Rupert

1957 “A Synopsis of Linguistic Theory 1930–1955.” Studies in Linguistic Analysis [Special Volume of the Philological Society]: 1–32. Oxford: Blackwell.

Fitzmaurice, Susan, Justyna A. Robinson, Marc Alexander, Iona C. Hine, Seth Mehl, and Fraser Dallachy

2017 “Linguistic DNA: Investigating Conceptual Change in Early Modern English Discourse.” Studia Neophilologica 89 (sup1): 21–38.

Funk, Christopher

2015 “Concept Recognition and Its Application for Protein Function Prediction.” Computational Biology Thesis Defense. University of Colorado. [URL]

Ghanem, Salma

1997 “Filling the Tapestry: The Second Level of Agenda Setting.” In Communication and Democracy: Exploring the Intellectual Frontiers in Agenda-Setting Theory, ed. by Maxwell McCombs, Donald L. Shaw and David Weaver. 3–14. Mahwah, NJ: Lawrence Erlbaum.

Grimmer, Justin, and Brandon Stewart

2013 “Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts.” Political Analysis 21 (3): 267–297.

Hilpert, Martin, and Stefan Gries

2016 “Quantitative Approaches to Diachronic Corpus Linguistics.” In The Cambridge Handbook of English Historical Linguistics, ed. by Merja Kytö, and Päivi Pahta, 36–53. Cambridge: Cambridge University Press.

Hundt, Marianne, David Denison, and Gerold Schneider

2012 “Relative Complexity in Scientific Discourse.” English Language and Linguistics 16 (2): 209–240.

Janda, Laura A.

(ed.) 2013 Cognitive Linguistics: The Quantitative Turn. The Essential Reader. Berlin: Mouton de Gruyter.

Jurafsky, Daniel, and James H. Martin

2009 Speech and Language Processing: An Introduction to Natural Language Processing, Speech Recognition, and Computational Linguistics. 2nd edition. Upper Saddle River, NJ: Prentice-Hall.

Keller, Frank and Mirella Lapata

2003 “Using the Web to Obtain Frequencies for Unseen Bigrams”. Computational Linguistics, 29:3, 459–484.

Lapata, Mirella and Frank Keller

2005 “Web-based Models for Natural Language Processing”. ACM Transactions on Speech and Language Processing, 2:1, 1–31.

Late Modern English Medical Texts 1700–1800 (LMEMT)

2019 Compiled by Taavitsainen, Irma, Turo Hiltunen, Ville Marttila, Päivi Pahta, Maura Ratia, Carla Suhr and Jukka Tyrkkö. Amsterdam: John Benjamins. CD-ROM published with a book.

Leech, Geoffrey

2007 “New Resources, or Just Better Old Ones? The Holy Grail of Representativeness.” In Corpus Linguistics and the Web, ed. by Marianne Hundt, Nadja Nesselhauf, and Carolin Biewer, 133–149. Amsterdam: Rodopi.

Michel, Jean-Baptiste, Shen, Yuan Kui, Aiden, Aviva P., Veres, Adrian, Gray, Matthew K., Pickett, Joseph P., Hoiberg, Dale, Clancy, Dan, Norvig, Peter, Orwant, Jon, Pinker, Steven, Nowak, Martin A. & Aiden, Erez Lieberman

2011 Quantitative analysis of culture using millions of digitized books. Science 331(6014): 176–182.

Oakes, Michael P.

2014 Literary Detective Work on the Computer. Amsterdam & Philadelphia, PA: Benjamins.

Roberts, Marilyn, Tzong-Horng (Dustin) Dzwo, and Wayne Wanta

2002 “Agenda Setting and Issue Salience Online.” Communication Research 29: 452–465.

Röder, Michael, Andreas Both, and Alexander Hinneburg

2015 “Exploring the Space of Topic Coherence Measures.” Proceedings of WSDM’15, February 2–6 2015, 399–408, Shanghai, China.

Sahlgren, Magnus

2006 The Word-Space Model: Using Distributional Analysis to Represent Syntagmatic and Paradigmatic Relations between Words in High- Dimensional Vector Spaces. PhD dissertation, Stockholm University.

Scally, Gabriel

2014 “Public Health Profession.” In Encyclopedia of Health Economics, Vol. 3, ed. by Anthony J. Culyer, 204–209. San Diego: Elsevier.

Schneider, Gerold

2018 “Differences between Swiss High German and German High German via Data-Driven Methods.” Proceedings of the 3rd Swiss Text Analytics Conference (SwissText 2018), Winterthur, Switzerland, ed. by Mark Ciliebak, Don Tuggener and Fernando Benites, 17–25. [URL]

Schneider, Gerold, Eva Pettersson, and Michael Percillier

2017 “Comparing Rule-Based and SMT-Based Spelling Normalisation for English Historical Texts.” Proceedings of the NoDaLiDa 2017 Workshop on Processing Historical Language, Gothenburg, Sweden, ed, by Gerlof Bouma and Yvonne Adesam, 40–46.

Schreiber-Gregory, Deanna

2018 “Regulation Techniques for Multicollinearity: Lasso, Ridge, and Elastic Nets.” Proceedings of Western Users of SAS Software Conferences 2018, September 5–7, 2018, Sacramento, California. [URL]

Schwartz, H. Andrew, and Lyle H. Ungar

2015 “Data-Driven Content Analysis of Social Media: A Systematic Overview of Automated Methods.” The ANNALS of the American Academy of Political and Social Science 659 (1): 78–94.

Sinclair, John and Ronald Carter

2004 Trust the Text: Language, Corpus and Discourse. London: Routledge.

Steinberger, Ralf, Aldo Podavini, Alexandra Balahur, Guillaume Jacquet, Hristo Tanev, Jens Linge, Martin Atkinson, Michele Chinosi, Vanni Zavarella, Yaniv Steiner, and Erik van der Goot

2015 “Observing Trends in Automated Multilingual Media Analysis.” Proceedings of the Symposium on New Frontiers of Automated Content Analysis in the Social Sciences (ACA’2015), Zürich, Switzerland 1–3 July, 1–8. [URL]

Taavitsainen, Irma, Turo Hiltunen, Anu Lehto, Ville Marttila, Päivi Pahta, Maura Ratia, Carla Suhr and Jukka Tyrkkö

2019 Late Modern English Medical Texts: The Corpus. In Late Modern English Medical Texts: Writing Medicine in the Eighteenth Century, ed. by Irma Taavitsainen, and Turo Hiltunen. Amsterdam: John Benjamins Publishing Company.

Taavitsainen, Irma, Päivi Pahta, Turo Hiltunen, Martti Mäkinen, Ville Marttila, Maura Ratia, Carla Suhr, and Jukka Tyrkkö

2010 Early Modern English Medical Texts: Corpus. In Early Modern English Medical Texts: Corpus Description and Studies, ed. by Irma Taavitsainen, and Päivi Pahta. Amsterdam: John Benjamins Publishing Company.

Taavitsainen, Irma, and Gerold Schneider

2018 “Scholastic Argumentation in Early English Medical Writing and Its Afterlife: New Corpus Evidence.” In From Data to Evidence in English Language Research, ed. by Carla Suhr, Terttu Nevalainen, and Irma Taavitsainen, 191–221. Leiden: Brill.

Taavitsainen, Irma, Gerold Schneider, and Peter Jones

2019 “Topics of Eighteenth-Century Medical Writing with Triangulation of Methods: LMEMT and the Underlying Reality.” In Late Modern English Medical Texts: Writing Medicine in the Eighteenth Century, ed. by Irma Taavitsainen, and Turo Hiltunen, 31–74. Benjamins: Amsterdam.

Tang, Jian, Zhaoshi Meng, Xuanlong Nguyen, Qiaozhu Mei, and Ming Zhang

2014 “Understanding the Limiting Factors of Topic Modeling via Posterior Contraction Analysis.” Proceedings of the 31st International Conference on Machine Learning, 32(1), ed. by Eric P. Xing, and Tony Jebara, 190–198. [URL]

Thompson, Paul, Riza Theresa Batista-Navarro, Georgios Kontonatsios, Jacob Carter, Elizabeth Toon, John McNaught, Carsten Timmermann, Michael Worboys, and Sophia Ananiadou

2016 “Text Mining the History of Medicine.” PLOS ONE 11 (1): e0144717.

Tognini-Bonelli, Elena

2001 Corpus Linguistics at Work. Amsterdam: Benjamins.

Villegas, Marta, Ander Intxaurrondo, Aitor Gonzalez-Agirre, Montserrat Marimon, and Martin Krallinger

2018 “The MeSpEN Resource for English-Spanish Medical Machine Translation and Terminologies: Census of Parallel Corpora, Glossaries and Term Translations.” In LREC MultilingualBIO: Multilingual Biomedical Text Processing, Miyazaki, Japan, ed. by Maite Melero, Martin Krallinger and Aitor Gonzalez-Agirre, 32–39, ELRA. [URL]

Chapter 3Medical topics and style from 1500 to 2018

A corpus-driven exploration

Chapter 3
Medical topics and style from 1500 to 2018