A data-driven approach to finding significant changes in language use through time series analysis
This paper conducts a diachronic study of language change in a corpus covering almost 30 years of mainstream UK news text. In our previous studies, several databases were compiled from the corpus, including diachronic records of word frequency, collocation and morphological analysis. Upon user enquiry, our WebCorp Linguist’s Search Engine produced tailored output from these resources. The system was therefore passive, requiring a word or phrase to be specified before querying the databases. The aim now is to extend the data-driven functionality to track the frequency of words in the corpus across time automatically and alert users to statistically significant change patterns. Three tests are employed to find upward and downward trends, sudden jumps in frequency, and seasonal variation.
Article outline
- 1.Introduction
- 2.Data and methods
- 2.1Trends
- 2.2Seasonality
- 2.3Sudden jumps (for otherwise rare words)
- 2.4Seasonal and trend decomposition using Loess (STL)
- 3.Results and discussion
- 3.1Seasonality test
- 3.2Trend test
- 3.3Sudden jumps test
- 4.Conclusions and future work
-
Notes
-
References
References (14)
References
Cleveland, Robert B., Cleveland, William. S., McRae, Jean E. & Terpenning, Irma. 1990. STL: A seasonal-trend decomposition procedure based on loess. Journal of Official Statistics 6(1): 3–33.
Cleveland, William S. 1981. LOWESS: A program for smoothing scatter plots by robust locally weighted regression. The American Statistician 35: 54.
Cox, David R. 1963. Large sample sequential tests for composite hypotheses. Sankhyā: The Indian Journal of Statistics, Series A (1961–2002) 25: 5–12.
Davies, Mark. 2013. Corpus of News on the Web (NOW): 3+ billion words from 20 countries, updated every day, <[URL]> (26 August 2020).
Eisenstein, Jacob, O’Connor, Brendan, Smith, Noah A. & Xing, Eric P. 2014. Diffusion of lexical change in social media. PLoS ONE 9(11): e113114.
Gao, Jianbo, Hu, Jing, Mao, Xiang & Perc, Matjaž. 2012. Culturomics meets random fractal theory: Insights into long-range correlations of social and natural phenomena over the past two centuries. Journal of The Royal Society Interface 9: 1956–1964.
Grieve, Jack, Nini, Andrea & Guo, Diansheng. 2017. Analyzing lexical emergence in modern American English online. English Language and Linguistics 21(1): 99–127.
Kehoe, Andrew & Gee, Matt. 2009. Weaving web data into a diachronic corpus patchwork. In Corpus Linguistics: Refinements and Reassessments [Language and Computers 69], Antoinette Renouf & Andrew Kehoe (eds), 255–279. Amsterdam: Rodopi.
Michel, Jean-Baptiste, Shen, Yuan Kui, Aiden, Aviva Presser, Veres, Adrian, Gray, Matthew K., The Google Books Team, Pickett, Joseph P., Hoiberg, Dale, Clancy, Dan, Norvig, Peter, Orwant, Jon, Pinker, Steven, Nowak, Martin A. & Aiden, Erez Lieberman. 2011. Quantitative analysis of culture using millions of digitized books. Science 331(6014): 176–182.
Renouf, Antoinette. 2018. Big Data: Opportunities and challenges for English corpus linguistics. In From Data to Evidence in English Language Research, Carla Suhr, Terttu Nevalainen & Irma Taavitsainen (eds), 27–65, Leiden: Brill.
Roger, James. 1977. A significance test for cyclic trends in incidence data. Biometrika 64: 152–155.
Wetherill, George Barrie. 1966. Sequential Methods in Statistics. London: Methuen.
Cited by (2)
Cited by two other publications
Gee, Matt, Andrew Kehoe & Antoinette Renouf
Landert, Daniela, Tanja Säily & Mika Hämäläinen
2023.
TV series as disseminators of emerging vocabulary: Non-codified expressions in the TV Corpus.
ICAME Journal 47:1
► pp. 63 ff.
This list is based on CrossRef data as of 29 october 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.