A data-driven approach to finding significant changes in language use through time series analysis
This paper conducts a diachronic study of language change in a corpus covering almost 30 years of mainstream UK news text. In our previous studies, several databases were compiled from the corpus, including diachronic records of word frequency, collocation and morphological analysis. Upon user enquiry, our WebCorp Linguist’s Search Engine produced tailored output from these resources. The system was therefore passive, requiring a word or phrase to be specified before querying the databases. The aim now is to extend the data-driven functionality to track the frequency of words in the corpus across time automatically and alert users to statistically significant change patterns. Three tests are employed to find upward and downward trends, sudden jumps in frequency, and seasonal variation.
Article outline
- 1.Introduction
- 2.Data and methods
- 2.1Trends
- 2.2Seasonality
- 2.3Sudden jumps (for otherwise rare words)
- 2.4Seasonal and trend decomposition using Loess (STL)
- 3.Results and discussion
- 3.1Seasonality test
- 3.2Trend test
- 3.3Sudden jumps test
- 4.Conclusions and future work
-
Notes
-
References
References
Cleveland, Robert B., Cleveland, William. S., McRae, Jean E. & Terpenning, Irma
1990 STL: A seasonal-trend decomposition procedure based on loess.
Journal of Official Statistics 6(1): 3–33.

Cleveland, William S.
1981 LOWESS: A program for smoothing scatter plots by robust locally weighted regression.
The American Statistician 35: 54.


Cox, David R.
1963 Large sample sequential tests for composite hypotheses.
Sankhyā: The Indian Journal of Statistics, Series A (1961–2002) 25: 5–12.

Davies, Mark
2013 Corpus of News on the Web (NOW): 3+ billion words from 20 countries, updated every day,
[URL] (26 August 2020).
Eisenstein, Jacob, O’Connor, Brendan, Smith, Noah A. & Xing, Eric P.
2014 Diffusion of lexical change in social media.
PLoS ONE 9(11): e113114.


Gao, Jianbo, Hu, Jing, Mao, Xiang & Perc, Matjaž
2012 Culturomics meets random fractal theory: Insights into long-range correlations of social and natural phenomena over the past two centuries.
Journal of The Royal Society Interface 9: 1956–1964.


Grieve, Jack, Nini, Andrea & Guo, Diansheng
2017 Analyzing lexical emergence in modern American English online.
English Language and Linguistics 21(1): 99–127.


Kehoe, Andrew & Gee, Matt
2009 Weaving web data into a diachronic corpus patchwork. In
Corpus Linguistics: Refinements and Reassessments [
Language and Computers 69],
Antoinette Renouf &
Andrew Kehoe (eds), 255–279. Amsterdam: Rodopi.

Kehoe, Andrew & Gee, Matt
Michel, Jean-Baptiste, Shen, Yuan Kui, Aiden, Aviva Presser, Veres, Adrian, Gray, Matthew K., The Google Books Team, Pickett, Joseph P., Hoiberg, Dale, Clancy, Dan, Norvig, Peter, Orwant, Jon, Pinker, Steven, Nowak, Martin A. & Aiden, Erez Lieberman
2011 Quantitative analysis of culture using millions of digitized books.
Science 331(6014): 176–182.


Renouf, Antoinette
2018 Big Data: Opportunities and challenges for English corpus linguistics. In
From Data to Evidence in English Language Research,
Carla Suhr,
Terttu Nevalainen &
Irma Taavitsainen (eds), 27–65, Leiden: Brill.

Roger, James
1977 A significance test for cyclic trends in incidence data.
Biometrika 64: 152–155.


Wetherill, George Barrie
1966 Sequential Methods in Statistics. London: Methuen.

Cited by
Cited by 1 other publications
Landert, Daniela, Tanja Säily & Mika Hämäläinen
2023.
TV series as disseminators of emerging vocabulary: Non-codified expressions in the TV Corpus.
ICAME Journal 47:1
► pp. 63 ff.

This list is based on CrossRef data as of 27 september 2023. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.