Ch. 12 | Exercise 1

Chapter 12 | Exercise 1

Case study
‘Are you a nerd or a geek?’

What do you think is the difference between the nouns nerd and geek in contemporary English? What connotations do the nouns have (positive, negative, neutral)? Do you think the meaning and use of the words have changed in the past decades? In particular, has the connotation stayed the same? In which registers would you expect to find the words? About whom, nerds or geeks, would you be more inclined to speak about as a social group (in plural)?

Examine the data frame nerd in Rling. If one needs to fit a regression model with “Noun” as the response variable, and the other five variables as predictors, would the data be sufficient?

Change the reference level of the Num variable (grammatical number) from “pl” (assigned alphabetically) to “sg”.

Fit a full model with all predictors without interactions. Using the table of coefficients of the model, answer the following questions:

Why is the academic register not displayed in the table of coefficients?

Which register is different with regard to the distribution of nerd and geek the most from the reference level? Is the difference statistically significant at the 0.05 level? How can you interpret the effect?

Is there a significant change in the use of the words in the 20^th and the 21^th centuries? What is the direction of this change?

Is there a significant difference in connotation between nerd and geek? Are the odds of nerd vs. geek in positive evaluation contexts greater or smaller than the odds of nerd in negative evaluation contexts?

Are the odds of nerd vs. geek greater or smaller in the plural than in the singular? Is this difference statistically significant?

Fit a GLM model with an interaction term between Eval and Century. Is it statistically significant? How can you interpret the interaction effect?

Does the model with the interaction have sufficient predictive power?

Is there evidence of overfitting? Run a bootstrap validation.

Have your expectations (see Question 1) been borne out? Which additional factors would you add to the model to tell nerds from geeks?

Key

Yes. The proportion of the less frequent response is 0.49:

> library(Rling)
> data(nerd)
> summary(nerd$Noun)
geek nerd 
 670  646

According to the rule of thumb, the maximum number of parameters in the model is the frequency of the least frequent outcome, nerd, divided by 10, that is, 646/10 ≈ 65. The actual number of parameters in the model, which includes 5 variables with two or three levels, is much smaller.

You can use the following command:

> nerd$Num <- relevel(nerd$Num, ref = "sg")

You can use the lrm() function from the package rms:

> library(rms)
> m <- lrm(Noun ~ Num + Century + Register + Eval, data = nerd)

In general, the model is significant. Next, see the table of coefficients:

[output is omitted]
              Coef    S.E.   Wald Z P     
Intercept      1.2314 0.3454  3.56  0.0004
Num=pl         0.2724 0.1291  2.11  0.0348
Century=XXI   -0.8063 0.1220 -6.61  0.0000
Register=MAG  -0.7457 0.3208 -2.32  0.0201
Register=NEWS -0.5962 0.3301 -1.81  0.0709
Register=SPOK -0.5729 0.3310 -1.73  0.0835
Eval=Neutral   0.0991 0.1942  0.51  0.6098
Eval=Positive -1.5084 0.2375 -6.35  0.0000

The Academic register is not displayed because it is the reference level of the variable Register.

Magazines, with the coefficient –0.7457. The difference is significant, p = 0.0201. The negative estimate shows that the odds of nerd vs. geek in the magazines are lower than in the academic texts. Or the other way round, the odds of nerd vs. geek in the academic texts are higher than in the magazines.

Yes, p < 0.0001. The negative estimate shows that the odds of nerd vs. geek have decreased in the 21^th century in comparison with the 20^th century. In other words, geek has become more popular in comparison with nerd.

Yes, there is a statistically significant difference in the odds of nerd vs. geek between the levels “Positive” and “Negative” (reference). The odds of nerd vs. geek are lower in the positive evaluation contexts.

Since the estimate of Num = “pl” is positive, the odds of nerd vs. geek are higher in the plural than in the singular. This difference is statistically significant.

Yes, it is significant:

> m.glm <- glm(Noun ~ Num + Century + Register + Eval, data = nerd, family = binomial)
> m.glm1 <- glm(Noun ~ Num + Century*Eval + Register, data = nerd, family = binomial)
> anova(m.glm, m.glm1, test = "Chisq")
Analysis of Deviance Table

Model 1: Noun ~ Num + Century + Register + Eval
Model 2: Noun ~ Num + Century * Eval + Register
  Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
1      1308     1643.6                          
2      1306     1626.3  2   17.283 0.0001766 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

> library(visreg)
> visreg(m.glm1, "Eval", by = "Century")

In both centuries, the negative evaluation contexts increase the chances of nerd, but this difference has become more dramatic in the XXI century.

For convenience, refit the model using the function lrm():

> m1 <- lrm(Noun ~ Num + Century*Eval + Register, data=nerd)
Logistic Regression Model

lrm(formula = Noun ~ Num + Century * Eval + Register, data = nerd)

                      Model Likelihood     Discrimination    Rank Discrim.    
                         Ratio Test            Indexes          Indexes       
Obs          1316    LR chi2     197.60    R2       0.186    C       0.689    
 geek         670    d.f.             9    g        0.936    Dxy     0.377    
 nerd         646    Pr(> chi2) <0.0001    gr       2.549    gamma   0.399    
max |deriv| 1e-09                          gp       0.190    tau-a   0.189    
                                           Brier    0.216
[output omitted]

The C index (0.689) is somewhat lower than 0.7, the level when the model becomes acceptable, which means that we are missing important predictors.

Use the function validate() from rms:

> m2 <- lrm(Noun ~ Num + Century*Eval + Register, data = nerd, x = T, y = T)
> validate(m2, B = 200)
[output omitted]

The optimism of the slope will fluctuate around 0.05. This suggests some mild overfitting.

Chapter 12 | Exercise 1

Case study ‘Are you a nerd or a geek?’

Case study
‘Are you a nerd or a geek?’