Spurious effects in variational corpus linguistics: Identification and implications of confounding

Tummers, José; Speelman, Dirk; Geeraerts, Dirk

doi:10.1075/ijcl.19.4.02tum

Article published In:

International Journal of Corpus Linguistics
Vol. 19:4 (2014) ► pp.478–504

Spurious effects in variational corpus linguistics

Identification and implications of confounding

José Tummers | Leuven University College

Dirk Speelman | KU Leuven

Dirk Geeraerts

As repositories of spontaneously realized language, corpora generally have an uncontrolled and unbalanced structure where all variables operate simultaneously. Consequently, a variable’s real effect can be concealed when studied in isolation because of the exclusion of the impact of other potentially confounding variables. Analyzing a variational case study, the alternation between inflected and uninflected attributive adjectives in Dutch, it will be demonstrated how confounding variables alter the impact of explanatory variables on the response variable, resulting in spurious effects in the bivariate analyses. Multiple Correspondence Analysis will be used as a heuristic tool to unveil the association patterns between explanatory variables in the data matrix which induce the spurious effects. Based on these findings, we will argue for a thorough analysis of the database patterns to gain insight in the underlying associations between explanatory variables before modeling their real impact on the response variable in a multivariate model.

Keywords: confounding, Multiple Correspondence Analysis, variational linguistics, spurious effects

Published online: 25 October 2014

https://doi.org/10.1075/ijcl.19.4.02tum

References (61)

Agresti, A. 2007. An Introduction to Categorical Data Analysis. New York: Wiley.

Arppe, A., Gilquin, G., Glynn, D., Hilpert, M. & Zeschel, A. 2010. “Cognitive corpus linguistics: Five points of debate on current theory and methodology”. Corpora, 5 (1), 1–27.

Arppe, A. & Järvikivi, J. 2007. “Every method counts: Combining corpus-based and experimental evidence in the study of synonymy”. Corpus Linguistics and Linguistic Theory, 3 (2), 131–159.

Baayen, R.H. 2008. Analyzing Linguistic Data. A Practical Introduction to Statistics Using R. Cambridge: Cambridge University Press.

Biber, D. 1988. Variation across Speech and Writing. Cambridge: Cambridge University Press.

Booij, G. 1992. “Congruentie in Nederlandse NP’s”. Spektator, 21 (2), 119–135.

. 2002. “Constructional idioms, morphology, and the Dutch lexicon”. Journal of Germanic Linguistics, 14 (4), 301–329.

Bresnan, J., Cueni, A., Nikitina, T. & Baayen, R.H. 2007. “Predicting the dative alternation”. In G. Bouma, I. Kraemer & J. Zwarts (Eds.), Cognitive Foundations of Interpretation. Amsterdam: Royal Netherlands Academy of Science, 69–94.

Butler, C.S. 1985. Statistics in Linguistics. Oxford: Blackwell.

Buttery, P. 2012. “Normalising frequency counts to account for ‘opportunity of use’ in learner corpora”. In Y. Tono, Y. Kawaguchi & M. Minegishi (Eds.), Development and Crosslinguistic Perspectives in Learner Corpora. Amsterdam: John Benjamins, 187–204.

Conrad, S. 2002. “Corpus linguistic approaches for discourse analysis”. Annual Review of Applied Linguistics, 221, 75–95.

Curley, S.P. & Browne, G.J. 2000. “Normative and descriptive analyses of Simpson’s paradox in decision making”. Organizational Behavior and Human Decision Processes, 84 (2), 308–333.

Daelemans, W. & Bosch, A. van den. 2005. Memory-Based Language Processing. Cambridge: Cambridge University Press.

Davis, M. 2010. “Contrast coding in multiple regression analysis: Strengths, weaknesses, and utility of popular coding structures”. Journal of Data Science, 8 (1), 61–73.

De Schutter, G. 1997. “The noun phrase in Dutch”. Leuvense Bijdragen, 86 (3), 309–356.

Dunning, T. 1993. “Accurate methods for the statistics of surprise and coincidence”. Computational Linguistics, 19 (1), 61–74.

Geeraerts, D. 2005. “Lectal variation and empirical data in cognitive linguistics”. In F. Ruiz de Mendoza (Ed.), Cognitive Linguistics: Internal Dynamics and Interdisciplinary Interaction. Berlin: Mouton de Gruyter, 163–190.

Geeraerts, D., Kristiansen, G. & Peirsman, Y. (Eds.) 2010. Advances in Cognitive Sociolinguistics. Berlin: Walter de Gruyter.

Granger, S. 2002. “A bird’s-eye view of learner corpus research”. In S. Granger, J. Hung & S. Petch-Tyson (Eds.), Computer Learner Corpora, Second Language Acquisition and Foreign Language Teaching. Amsterdam: John Benjamins, 3–36.

Greenacre, M.J. 1984. Theory and Application of Correspondence Analysis. London: Academic Press.

. 2003. Correspondence Analysis in Practice. London: Academic Press.

. 2006. “From simple to multiple correspondence analysis”. In M.J. Greenacre & J. Blasius (Eds.), Multiple Correspondence Analysis and Related Methods. London: Chapman & Hall, 41–77.

Greenland, S., Robins, J.M. & Pearl, J. 1999. “Confounding and collapsibility in causal inference”. Statistical Science, 14 (1), 29–46.

Gries, S. Th. 2003. Multifactorial Analysis in Corpus Linguistics: A Study Of Particle Placement. London: Continuum Press.

. 2011. “Commentary”. In K. Allan & J. Robinson (Eds.), Current Methods in Historical Semantics. Berlin/New York: Mouton de Gruyter, 184–195.

. 2013. Statistics for Linguistics with R: A Practical Introduction (2nd edition). Berlin: Mouton de Gruyter.

Gries, S. Th. & Hilpert, M. 2010. “From interdental to alveolar in the third person singular: A multifactorial, verb- and author-specific exploratory approach”. English Language and Linguistics, 14 (3), 293–320.

Grondelaers, S. & Speelman, D. 2007. “A variationist account of constituent ordering in presentative sentences in Belgian Dutch”. Corpus Linguistics and Linguistic Theory, 3 (2), 161–193.

Haeseryn, W., Romijn, K., Geerts, G., Rooij, J. de & Toorn, M.C. van den. 1997. Algemene Nederlandse Spraakkunst. Groningen: Martinus Nijhoff Uitgevers — Deurne: Wolters Plantyn.

Harrell, F.E. 2001. Regression Modeling Strategies, with Applications to Linear Models, Survival Analysis and Logistic Regression. New York: Springer.

Heylen, K. & Speelman, D. 2003. “A corpus-based analysis of word order variation: The order of verb arguments in the German Middle field”. In D. Archer, P. Rayson, A. Wilson & T. McEnery (Eds.), Proceedings of the Corpus Linguistics 2003 Conference. Lancaster: UCREL, 320–329.

Heylen, K., Tummers, J. & Geeraerts, D. 2008. “Methodological issues in corpus-based cognitive linguistics”. In G. Kristiansen & R. Dirven (Eds.), Cognitive Sociolinguistics. Language Variation, Cultural Models, Social Systems.Berlin: Mouton de Gruyter, 91–128.

Honselaar, W. 1980. “On the semantics of adjective-noun combinations”. In A.A. Barentsen, B.M. Groen & R. Sprenger (Eds.), Studies in Slavic and General Linguistics. Amsterdam: Rodopi, 187–206.

Johnson, K. 2008. Quantitative Methods in Linguistics. Oxford: Blackwell.

Juola, P. 2006. “Authorship attribution”. Foundations and Trends in Information Retrieval, 1 (3), 233–334.

Klooster, W. 2001. Grammatica van het Hedendaags Nederlands. Een Volledig Overzicht. Den Haag: Sdu Uitgevers.

Labov, W. 1969. “Contraction, deletion, and inherent variation in English copula”. Language, 45 (4), 725–762.

. 1972. “Some principles of linguistic methodology”. Language in Society, 1 (1), 97–120.

Lebrun, Y. & Schurmans-Swillen, G. 1966. “Verbogen tegenover onverbogen adjectieven in de taal van de Zuidnederlandse dagbladpers”. Taal en Tongval, 18 (1), 175–187.

Lipovetsky, S. & Conklin, W.M. 2006. “Data aggregation and Simpson’s paradox gauged by index numbers”. European Journal of Operational Research, 172 (1), 334–351.

Nenadic, O. & Greenacre, M.J. 2007. “Correspondence analysis in R, with two- and three-dimensional graphics: The ca package”. Journal of Statistical Software, 20 (3). Available at: [URL] (accessed June 2014).

Nurmi, H. 1997. “Voting paradoxes and referenda”. Social Choice and Welfare, 15 (3), 333–350.

Oostdijk, N. 2000. “Het corpus gesproken Nederlands”. Nederlandse Taalkunde, 5 (3), 280–284.

Pearl, J. 2000. Causality: Models, Reasoning, and Inference. Cambridge: Cambridge University Press.

R Development Core Team. 2012. R: A Language and Environment for Statistical Computing. Vienna: R Foundation for Statistical Computing. Available at: [URL] (accessed June 2014).

Rietveld, T. & Hout, R. van. 1993. Statistical Techniques for the Study of Language and Language Behaviour. Berlin: Mouton de Gruyter.

Römer, U. 2008. “Corpora and language teaching”. In A. Lüdeling & M. Kytö (Eds.), Corpus Linguistics. An International Handbook (Volume 11). Berlin: Mouton de Gruyter, 112–130.

Rooij, J. de. 1980a. “Ons bruin (e) paard I”. Taal en Tongval, 32 (1), 3–25.

. 1980b. “Ons bruin (e) paard II”. Taal en Tongval, 32 (2), 109–129.

Rousseau, P. & Sankoff, D. 1978. “Advances in variable rule methodology”. In D. Sankoff (Ed.), Linguistic Variation: Models and Methods. New York: Academic Press, 57–69.

Schield, M. 1999. “Simpson’s paradox and Cornfield’s conditions”. ASA-JSM, Proceedings of the Section of Statistical Education , 106–111.

Speelman, D., Grondelaers, S. & Geeraerts, D. 2003. “Profile-based linguistic uniformity as a generic method for comparing language varieties”. Computers and the Humanities, 37 (3), 317–337.

Stefanowitsch, A. 2003. “Constructional semantics as a limit to grammatical alternation: The two genitives of English”. In G. Rohdenbrug & B. Mondorf (Eds.), Determinants of Grammatical Variation. Berlin: Mouton de Gruyter, 413–444.

. 2011. “Cognitive linguistics meets the corpus”. In M. Brda, M. Fuchs & S. Th. Gries (Eds.), Expanding Cognitive Linguistic Horizons. Amsterdam: John Benjamins, 257–288.

Szmrecsanyi, B. 2013. “The great regression: Genitive variability in Late Modern English news texts”. In K. Börjars, D. Denison & A. Scott (Eds.), Morphosyntactic Categories and the Expression of Possession. Amsterdam: John Benjamins, 59–88.

Tagliamonte, S. & Baayen, R.H. 2012. “Models, forests and trees of York English: Was/were variation as a case study for statistical practice”. Language Variation and Change, 24 (2), 135–178.

Tu, Y.-K., Gunnell, D. & Gilthorpe, M. 2008. “Simpson’s paradox, Lord’s paradox, and suppression effects are the same phenomenon: The reversal paradox”. Emerging Themes in Epidemiology, 5 (2), 1–9.

Tummers, J. 2005. Het Naakte Adjectief. Kwantitatief-empirisch Onderzoek naar de Adjectivische Buigingsalternantie bij Neutra. Unpublished doctoral dissertation, KU Leuven, Belgium.

Tummers, J., Heylen, K. & Geeraerts, D. 2005. “Usage-based approaches in cognitive linguistics: A technical state of the art”. Corpus Linguistics and Linguistic Theory, 1 (2), 225–261.

Woods, A., Fletcher, P. & Hughes, A. 1986. Statistics in Language Studies. Cambridge: Cambridge University Press.

Wulff, B. 2010. “Applying corpus methods to written academic texts: Explorations of MICUSP”. Journal of Writing Research, 2 (2), 99–127.

Cited by (1)

Cited by one other publication

Tummers, Jose, Dirk Speelman, Kris Heylen & Dirk Geeraerts

2015. Lectal constraining of lexical collocations. Constructions and Frames 7:1 ► pp. 1 ff.

This list is based on CrossRef data as of 5 august 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.