A data-driven approach to finding significant changes in language use through time series analysis

Kehoe, Andrew; Gee, Matt; Renouf, Antoinette

doi:10.1075/scl.105.10keh

Part of

Broadening the Spectrum of Corpus Linguistics: New approaches to variability and change
Edited by Susanne Flach and Martin Hilpert
[Studies in Corpus Linguistics 105] 2022
► pp. 284–317

A data-driven approach to finding significant changes in language use through time series analysis

Andrew Kehoe | Birmingham City University

Matt Gee | Birmingham City University

Antoinette Renouf | Birmingham City University

This paper conducts a diachronic study of language change in a corpus covering almost 30 years of mainstream UK news text. In our previous studies, several databases were compiled from the corpus, including diachronic records of word frequency, collocation and morphological analysis. Upon user enquiry, our WebCorp Linguist’s Search Engine produced tailored output from these resources. The system was therefore passive, requiring a word or phrase to be specified before querying the databases. The aim now is to extend the data-driven functionality to track the frequency of words in the corpus across time automatically and alert users to statistically significant change patterns. Three tests are employed to find upward and downward trends, sudden jumps in frequency, and seasonal variation.

Keywords: diachrony, language change, variation, statistics, methods

Article outline

1.Introduction
2.Data and methods
- 2.1Trends
- 2.2Seasonality
- 2.3Sudden jumps (for otherwise rare words)
- 2.4Seasonal and trend decomposition using Loess (STL)
3.Results and discussion
- 3.1Seasonality test
- 3.2Trend test
- 3.3Sudden jumps test
4.Conclusions and future work
Notes
References

Published online: 10 November 2022

https://doi.org/10.1075/scl.105.10keh

References (14)

Cleveland, Robert B., Cleveland, William. S., McRae, Jean E. & Terpenning, Irma

1990 STL: A seasonal-trend decomposition procedure based on loess. Journal of Official Statistics 6(1): 3–33.

Cleveland, William S.

1981 LOWESS: A program for smoothing scatter plots by robust locally weighted regression. The American Statistician 35: 54.

Cox, David R.

1963 Large sample sequential tests for composite hypotheses. Sankhyā: The Indian Journal of Statistics, Series A (1961–2002) 25: 5–12.

Davies, Mark

2013 Corpus of News on the Web (NOW): 3+ billion words from 20 countries, updated every day, [URL] (26 August 2020).

Eisenstein, Jacob, O’Connor, Brendan, Smith, Noah A. & Xing, Eric P.

2014 Diffusion of lexical change in social media. PLoS ONE 9(11): e113114.

Gao, Jianbo, Hu, Jing, Mao, Xiang & Perc, Matjaž

2012 Culturomics meets random fractal theory: Insights into long-range correlations of social and natural phenomena over the past two centuries. Journal of The Royal Society Interface 9: 1956–1964.

Grieve, Jack, Nini, Andrea & Guo, Diansheng

2017 Analyzing lexical emergence in modern American English online. English Language and Linguistics 21(1): 99–127.

Kehoe, Andrew & Gee, Matt

2009 Weaving web data into a diachronic corpus patchwork. In Corpus Linguistics: Refinements and Reassessments [Language and Computers 69], Antoinette Renouf & Andrew Kehoe (eds), 255–279. Amsterdam: Rodopi.

2019 “Thanks for the donds”: A corpus linguistic analysis of topic-based communities in the comment section of The Guardian. In Reference and Identity in Public Discourses [Pragmatics & Beyond New Series 306], Ursula Lutzky & Minna Nevala (eds), 127–158. Amsterdam: John Benjamins.

Michel, Jean-Baptiste, Shen, Yuan Kui, Aiden, Aviva Presser, Veres, Adrian, Gray, Matthew K., The Google Books Team, Pickett, Joseph P., Hoiberg, Dale, Clancy, Dan, Norvig, Peter, Orwant, Jon, Pinker, Steven, Nowak, Martin A. & Aiden, Erez Lieberman

2011 Quantitative analysis of culture using millions of digitized books. Science 331(6014): 176–182.

Renouf, Antoinette

2013 A finer definition of neology in English: The life-cycle of a word. In Corpus Perspectives on Patterns of Lexis [Studies in Corpus Linguistics 57], Hilde Hasselgård, Signe Oksefjell Ebeling & Jarle Ebeling (eds), 177–208, Amsterdam: John Benjamins.

2018 Big Data: Opportunities and challenges for English corpus linguistics. In From Data to Evidence in English Language Research, Carla Suhr, Terttu Nevalainen & Irma Taavitsainen (eds), 27–65, Leiden: Brill.

Roger, James

1977 A significance test for cyclic trends in incidence data. Biometrika 64: 152–155.

Wetherill, George Barrie

1966 Sequential Methods in Statistics. London: Methuen.

Cited by (1)

Cited by 1 other publications

Landert, Daniela, Tanja Säily & Mika Hämäläinen

2023. TV series as disseminators of emerging vocabulary: Non-codified expressions in the TV Corpus. ICAME Journal 47:1 ► pp. 63 ff.

This list is based on CrossRef data as of 5 july 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.