Ch. 12 | Exercise 1

Chapter 12 | Exercise 1

Case study
‘Are you a nerd or a geek?’

1.

What do you think is the difference between the nouns nerd and geek in contemporary English? What connotations do the nouns have (positive, negative, neutral)? Do you think the meaning and use of the words have changed in the past decades? In particular, has the connotation stayed the same? In which registers would you expect to find the words? About whom, nerds or geeks, would you be more inclined to speak about as a social group (in plural)?

2.

Examine the data frame nerd in Rling. If one needs to fit a regression model with “Noun” as the response variable, and the other five variables as predictors, would the data be sufficient?

3.

Change the reference level of the Num variable (grammatical number) from “pl” (assigned alphabetically) to “sg”.

4.

Fit a full model with all predictors without interactions. Using the table of coefficients of the model, answer the following questions:

a.

Why is the academic register not displayed in the table of coefficients?

b.

Which register is different with regard to the distribution of nerd and geek the most from the reference level? Is the difference statistically significant at the 0.05 level? How can you interpret the effect?

c.

Is there a significant change in the use of the words in the 20th and the 21th centuries?  What is the direction of this change?

d.

Is there a significant difference in connotation between nerd and geek? Are the odds of nerd vs. geek in positive evaluation contexts greater or smaller than the odds of nerd in negative evaluation contexts?

e.

Are the odds of nerd vs. geek greater or smaller in the plural than in the singular? Is this difference statistically significant?

5.

Fit a GLM model with an interaction term between Eval and Century. Is it statistically significant? How can you interpret the interaction effect?

6.

Does the model with the interaction have sufficient predictive power?

7.

Is there evidence of overfitting? Run a bootstrap validation.

8.

Have your expectations (see Question 1) been borne out? Which additional factors would you add to the model to tell nerds from geeks?

2.

Yes. The proportion of the less frequent response is 0.49:

> library(Rling) > data(nerd) > summary(nerd$Noun) geek nerd 670 646

According to the rule of thumb, the maximum number of parameters in the model is the frequency of the least frequent outcome, nerd, divided by 10, that is, 646/10 ≈ 65. The actual number of parameters in the model, which includes 5 variables with two or three levels, is much smaller.

3.

You can use the following command:

> nerd$Num <- relevel(nerd$Num, ref = "sg")
4.

You can use the lrm() function from the package rms:

> library(rms) > m <- lrm(Noun ~ Num + Century + Register + Eval, data = nerd)

In general, the model is significant. Next, see the table of coefficients:

[output is omitted] Coef S.E. Wald Z P Intercept 1.2314 0.3454 3.56 0.0004 Num=pl 0.2724 0.1291 2.11 0.0348 Century=XXI -0.8063 0.1220 -6.61 0.0000 Register=MAG -0.7457 0.3208 -2.32 0.0201 Register=NEWS -0.5962 0.3301 -1.81 0.0709 Register=SPOK -0.5729 0.3310 -1.73 0.0835 Eval=Neutral 0.0991 0.1942 0.51 0.6098 Eval=Positive -1.5084 0.2375 -6.35 0.0000

The Academic register is not displayed because it is the reference level of the variable Register.

Magazines, with the coefficient –0.7457. The difference is significant, p = 0.0201. The negative estimate shows that the odds of nerd vs. geek in the magazines are lower than in the academic texts. Or the other way round, the odds of nerd vs. geek in the academic texts are higher than in the magazines.

Yes, p < 0.0001. The negative estimate shows that the odds of nerd vs. geek have decreased in the 21th century in comparison with the 20th century. In other words, geek has become more popular in comparison with nerd.

Yes, there is a statistically significant difference in the odds of nerd vs. geek between the levels “Positive” and “Negative” (reference). The odds of nerd vs. geek are lower in the positive evaluation contexts.

Since the estimate of Num = “pl” is positive, the odds of nerd vs. geek are higher in the plural than in the singular. This difference is statistically significant.

5.

Yes, it is significant:

> m.glm <- glm(Noun ~ Num + Century + Register + Eval, data = nerd, family = binomial) > m.glm1 <- glm(Noun ~ Num + Century*Eval + Register, data = nerd, family = binomial) > anova(m.glm, m.glm1, test = "Chisq") Analysis of Deviance Table Model 1: Noun ~ Num + Century + Register + Eval Model 2: Noun ~ Num + Century * Eval + Register Resid. Df Resid. Dev Df Deviance Pr(>Chi) 1 1308 1643.6 2 1306 1626.3 2 17.283 0.0001766 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > library(visreg) > visreg(m.glm1, "Eval", by = "Century")

In both centuries, the negative evaluation contexts increase the chances of nerd, but this difference has become more dramatic in the XXI century.

6.

For convenience, refit the model using the function lrm():

> m1 <- lrm(Noun ~ Num + Century*Eval + Register, data=nerd) Logistic Regression Model lrm(formula = Noun ~ Num + Century * Eval + Register, data = nerd) Model Likelihood Discrimination Rank Discrim. Ratio Test Indexes Indexes Obs 1316 LR chi2 197.60 R2 0.186 C 0.689 geek 670 d.f. 9 g 0.936 Dxy 0.377 nerd 646 Pr(> chi2) <0.0001 gr 2.549 gamma 0.399 max |deriv| 1e-09 gp 0.190 tau-a 0.189 Brier 0.216 [output omitted]

The C index (0.689) is somewhat lower than 0.7, the level when the model becomes acceptable, which means that we are missing important predictors.

7.

Use the function validate() from rms:

> m2 <- lrm(Noun ~ Num + Century*Eval + Register, data = nerd, x = T, y = T) > validate(m2, B = 200) [output omitted]

The optimism of the slope will fluctuate around 0.05. This suggests some mild overfitting.