Key words when text forms the unit of study: Sizing up the effects of different measures

Jeaco, Stephen

doi:10.1075/ijcl.18053.jea

Article published In:

International Journal of Corpus Linguistics
Vol. 25:2 (2020) ► pp.125–155

Key words when text forms the unit of study

Sizing up the effects of different measures

Stephen Jeaco | Xi’an Jiaotong-Liverpool University,

Throughout the social sciences, there has been growing pressure to present effect sizes when publishing empirical data (see American Psychological Association, 2001; Parsons & Nelson, 2004). While it seems indisputable that for the majority of quantitative research foci, effect size is an essential element of statistical analysis, this paper argues that specifically for key word analysis in corpus linguistics, the means of reporting effect size must depend on the level of the unit of study of each investigation (single text, collection or large corpus). After exploring some main criticisms of the log-likelihood measure, this paper unpacks the parameters of different measures for keyness and how they might address underlying concerns. It maintains that for the exploration of foregrounded/deviant/salient/marked features in text, the use of log-likelihood scores to rank the results is still fit for purpose and coupled with Bayes Factors is a solid approach for key word analyses.

Keywords: keyness, effect size, key word analysis, log-likelihood, ranking

Article outline

1.Introduction
2.Analysis
- 2.1Defining keyness
- 2.2Two measures of keyness: LL and %DIFF
- 2.3Determining appropriate measures for keyness
- 2.4Parameters used in different measures
- 2.5Rank frequency distributions of Candidate KWs
3.Implications
4.Conclusion
Acknowledgements
Notes
References

Available under the Creative Commons Attribution-NonCommercial (CC BY-NC) 4.0 license.

For any use beyond this license, please contact the publisher at [email protected].

Published online: 28 August 2020

https://doi.org/10.1075/ijcl.18053.jea

References (48)

References

Anthony, L. (2019). AntConc (Version 3.5.8) [Computer software]. Waseda University. [URL]

American Psychological Association. (2001). Publication Manual of the American Psychological Association (5th ed.). American Psychological Association.

Baker, P. (2004). Querying keywords: Questions of difference, frequency, and sense in keywords analysis. Journal of English Linguistics, 32(4), 346–359.

Baker, P., Gabrielatos, C., Khosravinik, M., Krzyżanowski, M., McEnery, T., & Wodak, R. (2008). A useful methodological synergy? Combining critical discourse analysis and corpus linguistics to examine discourses of refugees and asylum seekers in the UK press. Discourse & Society, 19(3), 273–306.

Bradley, J. V. (1960). Distribution-free Statistical Tests. Air Research and Development Command.

Brezina, V., McEnery, T., & Wattam, S. (2015). Collocations in context: A new perspective on collocation networks. International Journal of Corpus Linguistics, 20(2), 139–173.

Cobb, T. (2000). The Compleat Lexical Tutor (Version 8.3) [Computer software]. Retrieved November, 2019, from [URL]

Croft, W. B., Metzler, D., & Strohman, T. (2010). Search Engines: Information Retrieval in Practice. Addison-Wesley.

Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), 61–74.

Egbert, J., & Biber, D. (2019). Incorporating text dispersion into keyword analyses. Corpora, 14 (1), 77–104.

Gabrielatos, C. (2018). Keyness analysis: Nature, metrics and techniques. In C. Taylor & A. Marchi (Eds.) Corpus Approaches to Discourse: A Critical Review. Routledge.

Gabrielatos, C., & Marchi, A. (2012). Keyness: Appropriate metrics and practical issues [Paper presentation]. CADS International Conference 2012, University of Bologna, Italy. [URL]

Gabrielatos, C., Torgersen, E. N., Hoffmann, S., & Fox, S. (2010). A corpus-based sociolinguistic study of indefinite article forms in London English. Journal of English Linguistics, 38(4), 297–334.

Grissom, R. J., & Kim, J. J. (2012). Effect Sizes for Research: Univariate and Multivariate Applications. Routledge.

Hardie, A. (2012). CQPweb: Combining power, flexibility and usability in a corpus analysis tool. International Journal of Corpus Linguistics, 17(3), 380–409.

(2014a). Log Ratio – an informal introduction. ESRC Centre for Corpus Approaches to Social Science (CASS). [URL]

(2014b). Statistical identification of keywords, lockwords and collocations as a two-step procedure [Paper presentation]. ICAME 35 Conference, University of Nottingham, Nottingham, UK.

Hoey, M. (2005). Lexical Priming: A New Theory of Words and Language. Routledge.

Jeaco, S. (2017). Concordancing lexical primings: The rationale and design of a user-friendly corpus tool for English language teaching and self-tutoring based on the Lexical Priming theory of language. In M. Pace-Sigge & K. J. Patterson (Eds.), Lexical Priming: Applications and Advances. John Benjamins.

Johnston, J. E., Berry, K. J., & Mielke Jr, P. W. (2006). Measures of effect size for chi-squared and likelihood-ratio goodness-of-fit tests. Perceptual and Motor Skills, 103(2), 412–414.

Kass, R. E., & Raftery, A. E. (1995). Bayes Factors. Journal of the American Statistical Association, 90(430), 773.

Kilgarriff, A., Rychly, P., Smrz, P., & Tugwell, D. (2004). The Sketch Engine [Paper presentation]. The 2003 International Conference on Natural Language Processing and Knowledge Engineering, Beijing, China.

Lee, D. Y. W. (2001). Genres, registers, text types, domains, and styles: Clarifying the concepts and navigating a path through the BNC jungle. Language Learning and Technology, 5(3), 37–72.

Leech, G. N., Hundt, M., Mair, C., & Smith, N. (2009). Change in Contemporary English: A Grammatical Study. Cambridge Univerisity Press.

Leech, G. N., & Short, M. H. (2007). Style in Fiction: A Linguistic Introduction to English Fictional Prose (2nd ed.). Pearson Longman. (Original work published 1981)

Lexical Computing Ltd. (2014). Statistics used in the Sketch Engine. [URL]

Mahlberg, M. (2013). Corpus Stylistics and Dickens’s Fiction. Routledge.

Mahlberg, M., Stockwell, P., de Joode, J., Smith, C., & O’Donnell, M. B. (2016). CLiC Dickens: Novel uses of concordances for the integration of corpus stylistics and cognitive poetics. Corpora, 11(3), 433–463.

Oakes, M. P. (1998). Statistics for Corpus Linguistics. Edinburgh University Press.

Parsons, T. D., & Nelson, N. W. (2004). Paradigm shift in social science research: A significance testing and effect size estimation rapprochement? PsycCRITIQUES, 491(Suppl 3).

Partington, A. (2010). Modern Diachronic Corpus-Assisted Discourse Studies (MD-CADS) on UK newspapers: An overview of the project. Corpora, 5(2), 83–108.

Plonsky, L., & Oswald, F. L. (2014). How big is “big”? Interpreting effect sizes in L2 research. Language Learning, 64(4), 878–912.

Raftery, A. E. (1986). A note on Bayes Factors for Log-Linear contingency table models with vague prior information. Journal of the Royal Statistical Society. Series B (Methodological), 48(2), 249–250.

Rayson, P. (n.d.). UCREL Log-likelihood and effect size calculator. Retrieved November, 2019, from [URL]

(2008). From key words to key semantic domains. International Journal of Corpus Linguistics, 13(4), 519–549.

Rayson, P., Berridge, D., & Francis, B. (2004). Extending the Cochran rule for the comparison of word frequencies between corpora [Paper presentation]. The 7th International Conference on Statistical Analysis of Textual Data, Louvain-la-Neuve, Belgium. [URL]

Rayson, P., & Garside, R. (2000). Comparing corpora using frequency profiling [Paper presentation]. The Workshop on Comparing Corpora, Hong Kong University of Science and Technology, Hong Kong. [URL]

Rayson, P., Leech, G., & Hodges, M. (1997). Social differentiation in the use of English vocabulary: Some analyses of the conversational component of the British National Corpus. International Journal of Corpus Linguistics, 2(1), 133–152.

Read, T. R. C., & Cressie, N. A. C. (1988). Goodness-of-fit Statistics for Discrete Multivariate Data. Springer.

Scott, M. (1997). PC analysis of key words – and key key words. System, 25(2), 233–245.

(2016). WordSmith Tools (Version 7.0) [Computer software]. Stroud: Lexical Analysis Software.

(2019a). WordSmith Tools online manual “KeyWords: Calculation”. Retrieved November, 2019, from [URL]

(2019b). WordSmith Tools online manual “KeyWords”. Retrieved November, 2019, from [URL]

(2019c). WordSmith Tools online manual “KeyWords: Thinking about keyness”. Retrieved November, 2019, from [URL]

(2019d). WordSmith Tools online manual “KeyWords: Keyness definition”. Retrieved November, 2019, from [URL]

Scott, M., & Tribble, C. (2006). Textual Patterns: Key Words and Corpus Analysis in Language Education. John Benjamins.

Wilson, A. (2013). Embracing Bayes Factors for key item analysis in corpus linguistics. In M. Bieswanger & A. Koll-Stobbe (Eds.), New Approaches to the Study of Linguistic Variability (pp. 3–12). Peter Lang.

Zipf, G. K. (1935). The Psycho-Biology of Language: An Introduction to Dynamic Philology. Houghton Mifflin.

Cited by (6)

Cited by six other publications

Order by:

Ballance, Oliver J. & Averil Coxhead

2024. Corpus Analysis of Vocabulary. In The Encyclopedia of Applied Linguistics, ► pp. 1 ff.

Gillings, Mathew, Gerlinde Mautner & Paul Baker

2023. Corpus-Assisted Discourse Studies,

Malory, Beth

2023. Locating the ‘Age of Prescriptivism’ in Late Modern periodical reviews: a corpus-assisted discourse analytic approach. Journal of Historical Sociolinguistics 9:2 ► pp. 263 ff.

Jeaco, Stephen

2020. DIY Needs Analysis and Specific Text Types: Using The Prime Machine to Explore Vocabulary in Readymade and Homemade English Corpora. In Vocabulary in Curriculum Planning, ► pp. 199 ff.

Jeaco, Stephen

2023. How can we communicate (visually) what we (usually) mean by collocation and keyness?. Journal of Second Language Studies 6:1 ► pp. 29 ff.

[no author supplied]

2023. Language and Characterisation in Television Series [Studies in Corpus Linguistics, 106],

This list is based on CrossRef data as of 5 august 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.