Key words when text forms the unit of study
Sizing up the effects of different measures
Throughout the social sciences, there has been growing pressure to present effect sizes when publishing empirical data (see American Psychological Association, 2001; Parsons & Nelson, 2004). While it seems indisputable that for the majority of quantitative research foci, effect size is an essential element of statistical analysis, this paper argues that specifically for key word analysis in corpus linguistics, the means of reporting effect size must depend on the level of the unit of study of each investigation (single text, collection or large corpus). After exploring some main criticisms of the log-likelihood measure, this paper unpacks the parameters of different measures for keyness and how they might address underlying concerns. It maintains that for the exploration of foregrounded/deviant/salient/marked features in text, the use of log-likelihood scores to rank the results is still fit for purpose and coupled with Bayes Factors is a solid approach for key word analyses.
Article outline
- 1.Introduction
- 2.Analysis
- 2.1Defining keyness
- 2.2Two measures of keyness: LL and %DIFF
- 2.3Determining appropriate measures for keyness
- 2.4Parameters used in different measures
- 2.5Rank frequency distributions of Candidate KWs
- 3.Implications
- 4.Conclusion
- Acknowledgements
- Notes
-
References
References
References
Anthony, L.
(
2019)
AntConc (
Version 3.5.8) [Computer software]. Waseda University.
[URL]
American Psychological Association
(
2001)
Publication Manual of the American Psychological Association (5th ed.). American Psychological Association.
Baker, P.
(
2004)
Querying keywords: Questions of difference, frequency, and sense in keywords analysis.
Journal of English Linguistics, 32(4), 346–359.
Baker, P., Gabrielatos, C., Khosravinik, M., Krzyżanowski, M., McEnery, T., & Wodak, R.
(
2008)
A useful methodological synergy? Combining critical discourse analysis and corpus linguistics to examine discourses of refugees and asylum seekers in the UK press.
Discourse & Society, 19(3), 273–306.
Bradley, J. V.
(
1960)
Distribution-free Statistical Tests. Air Research and Development Command.
Brezina, V., McEnery, T., & Wattam, S.
Cobb, T.
(
2000)
The Compleat Lexical Tutor (Version 8.3) [Computer software]. Retrieved November, 2019, from
[URL]
Croft, W. B., Metzler, D., & Strohman, T.
(
2010)
Search Engines: Information Retrieval in Practice. Addison-Wesley.
Dunning, T.
(
1993)
Accurate methods for the statistics of surprise and coincidence.
Computational Linguistics, 19(1), 61–74.
Egbert, J., & Biber, D.
(
2019)
Incorporating text dispersion into keyword analyses.
Corpora, 14 (1), 77–104.
Gabrielatos, C.
(
2018)
Keyness analysis: Nature, metrics and techniques. In
C. Taylor &
A. Marchi (Eds.)
Corpus Approaches to Discourse: A Critical Review. Routledge.
Gabrielatos, C., & Marchi, A.
(
2012)
Keyness: Appropriate metrics and practical issues [Paper presentation]. CADS International Conference 2012, University of Bologna, Italy.
[URL]
Gabrielatos, C., Torgersen, E. N., Hoffmann, S., & Fox, S.
(
2010)
A corpus-based sociolinguistic study of indefinite article forms in London English.
Journal of English Linguistics, 38(4), 297–334.
Grissom, R. J., & Kim, J. J.
(
2012)
Effect Sizes for Research: Univariate and Multivariate Applications. Routledge.
Hardie, A.
(
2014a)
Log Ratio – an informal introduction.
ESRC Centre for Corpus Approaches to Social Science (CASS).
[URL]
Hardie, A.
(
2014b)
Statistical identification of keywords, lockwords and collocations as a two-step procedure [Paper presentation]. ICAME 35 Conference, University of Nottingham, Nottingham, UK.
Hoey, M.
(
2005)
Lexical Priming: A New Theory of Words and Language. Routledge.
Johnston, J. E., Berry, K. J., & Mielke Jr, P. W.
(
2006)
Measures of effect size for chi-squared and likelihood-ratio goodness-of-fit tests.
Perceptual and Motor Skills, 103(2), 412–414.
Kass, R. E., & Raftery, A. E.
(
1995)
Bayes Factors.
Journal of the American Statistical Association, 90(430), 773.
Kilgarriff, A., Rychly, P., Smrz, P., & Tugwell, D.
(
2004)
The Sketch Engine [Paper presentation]. The 2003 International Conference on Natural Language Processing and Knowledge Engineering, Beijing, China.
Lee, D. Y. W.
(
2001)
Genres, registers, text types, domains, and styles: Clarifying the concepts and navigating a path through the BNC jungle.
Language Learning and Technology, 5(3), 37–72.
Leech, G. N., Hundt, M., Mair, C., & Smith, N.
(
2009)
Change in Contemporary English: A Grammatical Study. Cambridge Univerisity Press.
Leech, G. N., & Short, M. H.
(
2007)
Style in Fiction: A Linguistic Introduction to English Fictional Prose (2nd ed.). Pearson Longman. (Original work published 1981)
Lexical Computing Ltd
(
2014)
Statistics used in the Sketch Engine.
[URL]
Mahlberg, M.
(
2013)
Corpus Stylistics and Dickens’s Fiction. Routledge.
Mahlberg, M., Stockwell, P., de Joode, J., Smith, C., & O’Donnell, M. B.
(
2016)
CLiC Dickens: Novel uses of concordances for the integration of corpus stylistics and cognitive poetics.
Corpora, 11(3), 433–463.
Oakes, M. P.
(
1998)
Statistics for Corpus Linguistics. Edinburgh University Press.
Parsons, T. D., & Nelson, N. W.
(
2004)
Paradigm shift in social science research: A significance testing and effect size estimation rapprochement? PsycCRITIQUES, 491(
Suppl 3).
Partington, A.
(
2010)
Modern Diachronic Corpus-Assisted Discourse Studies (MD-CADS) on UK newspapers: An overview of the project.
Corpora, 5(2), 83–108.
Plonsky, L., & Oswald, F. L.
(
2014)
How big is “big”? Interpreting effect sizes in L2 research.
Language Learning, 64(4), 878–912.
Raftery, A. E.
(
1986)
A note on Bayes Factors for Log-Linear contingency table models with vague prior information.
Journal of the Royal Statistical Society. Series B (Methodological), 48(2), 249–250.
Rayson, P.
n.d.).
UCREL Log-likelihood and effect size calculator. Retrieved November, 2019, from
[URL]
Rayson, P., Berridge, D., & Francis, B.
(
2004)
Extending the Cochran rule for the comparison of word frequencies between corpora [Paper presentation]. The 7th International Conference on Statistical Analysis of Textual Data, Louvain-la-Neuve, Belgium.
[URL]
Rayson, P., & Garside, R.
(
2000)
Comparing corpora using frequency profiling [Paper presentation]. The Workshop on Comparing Corpora, Hong Kong University of Science and Technology, Hong Kong.
[URL]
Rayson, P., Leech, G., & Hodges, M.
Read, T. R. C., & Cressie, N. A. C.
(
1988)
Goodness-of-fit Statistics for Discrete Multivariate Data. Springer.
Scott, M.
(
1997)
PC analysis of key words – and key key words.
System, 25(2), 233–245.
Scott, M.
(
2016)
WordSmith Tools (
Version 7.0) [Computer software]. Stroud: Lexical Analysis Software.
Scott, M.
(
2019a)
WordSmith Tools online manual “KeyWords: Calculation”. Retrieved November, 2019, from
[URL]
Scott, M.
(
2019b)
WordSmith Tools online manual “KeyWords”. Retrieved November, 2019, from
[URL]
Scott, M.
(
2019c)
WordSmith Tools online manual “KeyWords: Thinking about keyness”. Retrieved November, 2019, from
[URL]
Scott, M.
(
2019d)
WordSmith Tools online manual “KeyWords: Keyness definition”. Retrieved November, 2019, from
[URL]
Wilson, A.
(
2013)
Embracing Bayes Factors for key item analysis in corpus linguistics. In
M. Bieswanger &
A. Koll-Stobbe (Eds.),
New Approaches to the Study of Linguistic Variability (pp. 3–12). Peter Lang.
Zipf, G. K.
(
1935)
The Psycho-Biology of Language: An Introduction to Dynamic Philology. Houghton Mifflin.
Cited by
Cited by 1 other publications
Jeaco, Stephen
2020.
DIY Needs Analysis and Specific Text Types: Using The Prime Machine to Explore Vocabulary in Readymade and Homemade English Corpora. In
Vocabulary in Curriculum Planning,
► pp. 199 ff.
This list is based on CrossRef data as of 15 april 2022. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.