# Chapter 12 | Exercise 1

### ‘Are you a nerd or a geek?’

What do you think is the difference between the nouns *nerd* and *geek* in contemporary English? What connotations do the nouns have (positive, negative, neutral)? Do you think the meaning and use of the words have changed in the past decades? In particular, has the connotation stayed the same? In which registers would you expect to find the words? About whom, *nerds* or *geeks*, would you be more inclined to speak about as a social group (in plural)?

Examine the data frame `nerd`

in `Rling`

. If one needs to fit a regression model with “Noun” as the response variable, and the other five variables as predictors, would the data be sufficient?

Change the reference level of the *Num* variable (grammatical number) from “pl” (assigned alphabetically) to “sg”.

Fit a full model with all predictors without interactions. Using the table of coefficients of the model, answer the following questions:

Why is the academic register not displayed in the table of coefficients?

Which register is different with regard to the distribution of *nerd* and *geek* the most from the reference level? Is the difference statistically significant at the 0.05 level? How can you interpret the effect?

Is there a significant change in the use of the words in the 20^{th} and the 21^{th} centuries? What is the direction of this change?

Is there a significant difference in connotation between *nerd* and *geek*? Are the odds of *nerd* vs. *geek* in positive evaluation contexts greater or smaller than the odds of *nerd* in negative evaluation contexts?

Are the odds of *nerd* vs. *geek* greater or smaller in the plural than in the singular? Is this difference statistically significant?

Fit a GLM model with an interaction term between *Eval* and *Century*. Is it statistically significant? How can you interpret the interaction effect?

Does the model with the interaction have sufficient predictive power?

Is there evidence of overfitting? Run a bootstrap validation.

Have your expectations (see Question 1) been borne out? Which additional factors would you add to the model to tell nerds from geeks?

Yes. The proportion of the less frequent response is 0.49:

```
> library(Rling)
> data(nerd)
> summary(nerd$Noun)
geek nerd
670 646
```

According to the rule of thumb, the maximum number of parameters in the model is the frequency of the least frequent outcome, *nerd*, divided by 10, that is, 646/10 ≈ 65. The actual number of parameters in the model, which includes 5 variables with two or three levels, is much smaller.

You can use the following command:

```
> nerd$Num <- relevel(nerd$Num, ref = "sg")
```

You can use the `lrm()`

function from the package `rms`

:

```
> library(rms)
> m <- lrm(Noun ~ Num + Century + Register + Eval, data = nerd)
```

In general, the model is significant. Next, see the table of coefficients:

```
[output is omitted]
Coef S.E. Wald Z P
Intercept 1.2314 0.3454 3.56 0.0004
Num=pl 0.2724 0.1291 2.11 0.0348
Century=XXI -0.8063 0.1220 -6.61 0.0000
Register=MAG -0.7457 0.3208 -2.32 0.0201
Register=NEWS -0.5962 0.3301 -1.81 0.0709
Register=SPOK -0.5729 0.3310 -1.73 0.0835
Eval=Neutral 0.0991 0.1942 0.51 0.6098
Eval=Positive -1.5084 0.2375 -6.35 0.0000
```

The Academic register is not displayed because it is the reference level of the variable *Register*.

Magazines, with the coefficient –0.7457. The difference is significant, *p* = 0.0201. The negative estimate shows that the odds of *nerd* vs. *geek* in the magazines are lower than in the academic texts. Or the other way round, the odds of *nerd* vs. *geek* in the academic texts are higher than in the magazines.

Yes, *p* < 0.0001. The negative estimate shows that the odds of *nerd* vs. *geek* have decreased in the 21^{th} century in comparison with the 20^{th} century. In other words, *geek* has become more popular in comparison with *nerd*.

Yes, there is a statistically significant difference in the odds of *nerd* vs. *geek* between the levels “Positive” and “Negative” (reference). The odds of *nerd* vs. *geek* are lower in the positive evaluation contexts.

Since the estimate of Num = “pl” is positive, the odds of *nerd* vs. *geek* are higher in the plural than in the singular. This difference is statistically significant.

Yes, it is significant:

```
> m.glm <- glm(Noun ~ Num + Century + Register + Eval, data = nerd, family = binomial)
> m.glm1 <- glm(Noun ~ Num + Century*Eval + Register, data = nerd, family = binomial)
> anova(m.glm, m.glm1, test = "Chisq")
Analysis of Deviance Table
Model 1: Noun ~ Num + Century + Register + Eval
Model 2: Noun ~ Num + Century * Eval + Register
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 1308 1643.6
2 1306 1626.3 2 17.283 0.0001766 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> library(visreg)
> visreg(m.glm1, "Eval", by = "Century")
```

In both centuries, the negative evaluation contexts increase the chances of *nerd*, but this difference has become more dramatic in the XXI century.

For convenience, refit the model using the function `lrm()`

:

```
> m1 <- lrm(Noun ~ Num + Century*Eval + Register, data=nerd)
Logistic Regression Model
lrm(formula = Noun ~ Num + Century * Eval + Register, data = nerd)
Model Likelihood Discrimination Rank Discrim.
Ratio Test Indexes Indexes
Obs 1316 LR chi2 197.60 R2 0.186 C 0.689
geek 670 d.f. 9 g 0.936 Dxy 0.377
nerd 646 Pr(> chi2) <0.0001 gr 2.549 gamma 0.399
max |deriv| 1e-09 gp 0.190 tau-a 0.189
Brier 0.216
[output omitted]
```

The *C* index (0.689) is somewhat lower than 0.7, the level when the model becomes acceptable, which means that we are missing important predictors.

Use the function `validate()`

from `rms`

:

```
> m2 <- lrm(Noun ~ Num + Century*Eval + Register, data = nerd, x = T, y = T)
> validate(m2, B = 200)
[output omitted]
```

The optimism of the slope will fluctuate around 0.05. This suggests some mild overfitting.