# Chapter 7 | Exercise 1

For this case study, you will need the data set `ldt`

from the `Rling`

package, which was discussed in Chapters 3 and 6.

Fit a linear regression model with the mean reaction times as the dependent variable and log-transformed corpus word frequency *Freq* (add 1 to avoid -Inf) and word length *Length* as predictors. Is the model significant? What is the predictive power of the model? What is the effect of the predictors on the response? Are these effects statistically significant?

Compute the 95% and 99% confidence intervals of the regression estimates in the model.

According to the fitted model, how much time on average would it take to recognize a word with 5 letters and the corpus frequency of 100? Compute manually the fitted value.

Which variables survive backward, forward and bidirectional stepwise selection?

Check if the linearity assumption is met with the help of the component-residual plot.

Check if the residuals are distributed homoscedastically.

Test the model for multicollinearity between the predictors. Is it acceptable?

Are the residuals distributed normally?

Find two dangerous outliers and fit a new model without them. Can you see the difference in the new model? What about the distribution of residuals and heteroscedasticity?

Check if the new model overfits the data by using 200 bootstrap samples.

Test whether there is significant interaction between the explanatory variables in the new model.

Load the data and fit a linear regression model:

```
> library(Rling)
> data(ldt)
> m <- lm(Mean_RT ~ Length + log1p(Freq), data = ldt)
> summary(m)
Call:
lm(formula = Mean_RT ~ Length + log1p(Freq), data = ldt)
Residuals:
Min 1Q Median 3Q Max
-237.14 -72.58 -13.03 46.35 565.58
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 714.282 63.473 11.253 < 2e-16 ***
Length 26.132 5.237 4.990 2.66e-06 ***
log1p(Freq) -21.313 4.971 -4.287 4.27e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 111.9 on 97 degrees of freedom
Multiple R-squared: 0.477, Adjusted R-squared: 0.4662
F-statistic: 44.24 on 2 and 97 DF, p-value: 2.22e-14
```

The *p*-value based on the *F*-statistic is very small, which means that the model in general is significant. The *R ^{2}
* statistic is 0.477, which suggests that the model has some explanatory power, although probably not all relevant factors are taken into account. The estimated coefficient of

*Length*is 26.132. This means that with every additional letter of a stimulus, the reaction time increases by 26.132 ms. The coefficient of log-transformed

*Freq*is -21.313. This means that with every unit of log-transformed frequency plus 1, the reaction time decreases by 21.313 ms. These effects are statistically significant.

Compute the 95% confidence intervals (the default):

```
> confint(m)
2.5 % 97.5 %
(Intercept) 588.30431 840.25871
Length 15.73760 36.52571
log1p(Freq) -31.17832 -11.44674
```

Compute the 99% confidence intervals:

```
> confint(m, level = 0.99)
0.5 % 99.5 %
(Intercept) 547.50710 881.055913
Length 12.37153 39.891786
log1p(Freq) -34.37332 -8.251747
```

The fitted value is as follows:

```
> 714.28 + 5*26.13 - 21.31*(log1p(100))
[1] 746.5818
```

Both variables survive all three variable selection procedures. The code is as follows:

```
> m0 <- lm(Mean_RT ~ 1, data = ldt)
> step(m0, scope = ~ Length + log1p(Freq), direction = "forward") # forward selection
> step(m, direction = "backward") # backward selection
> step(m0, scope = ~ Length + log1p(Freq)) # bidirectional selection
```

All three methods converge: both explanatory variables contribute to the model substantially.

The component-residual plots do not reveal marked deviations from linearity:

```
> library(car) #if you haven’t loaded it yet
> crPlot(m, var = "Length")
> crPlot(m, var = "log1p(Freq)")
```

```
> library(car) #if you haven’t loaded it yet
> plot(m, which = 1)
> ncvTest(m)
Non-constant Variance Score Test
Variance formula: ~ fitted.values
Chisquare = 11.85373 Df = 1 p = 0.0005754606
> ncvTest(m, ~ Length)
Non-constant Variance Score Test
Variance formula: ~ Length
Chisquare = 16.11164 Df = 1 p = 5.971584e-05
> ncvTest(m, ~ log1p(Freq))
Non-constant Variance Score Test
Variance formula: ~ log1p(Freq)
Chisquare = 3.289005 Df = 1 p = 0.06974526
```

The diagnostic plot and non-constant variance tests suggest that there is some heteroschedasticity, in particular, in the relationship between the response and word length.

```
> library(car) # if you haven’t done so yet
> vif(m)
Length log1p(Freq)
1.356621 1.356621
```

The VIF-scores are too low to suspect multicollinearity.

```
> shapiro.test(m$residuals)
Shapiro-Wilk normality test
data: m$residuals
W = 0.89934, p-value = 1.317e-06
```

The residuals are not normally distributed.

```
> library(car) # if you haven’t done so yet
> influencePlot(m, id.method = "identify")
```

The word *diacritical* has a very high residual and Cook’s score, followed by *dessertspoon*.

```
> m1 <- lm(Mean_RT ~ Length + log1p(Freq), data = ldt[-c(29, 100),])
> summary(m1)
[output omitted]
```

From the summary one can see that the coefficient of *Length* has become smaller, and both *R ^{2}
* measures have improved. Moreover, the non-constant variance test shows no significant evidence of heteroschedasticity any more, and the residuals are now normally distributed:

```
> ncvTest(m1)
Non-constant Variance Score Test
Variance formula: ~ fitted.values
Chisquare = 3.002491 Df = 1 p = 0.08313662
> ncvTest(m1, ~ Length)
Non-constant Variance Score Test
Variance formula: ~ Length
Chisquare = 3.720857 Df = 1 p = 0.05373677
> shapiro.test(m1$residuals)
Shapiro-Wilk normality test
data: m1$residuals
W = 0.98278, p-value = 0.2288
```

```
> library(rms)
> m.ols <- ols(Mean_RT ~ Length + log1p(Freq), data = ldt[-c(29, 100),], x = TRUE, y = TRUE)
> validate(m.ols, B = 200)
[output omitted]
```

Since the algorithm involves random sampling from the original data set, the results will vary from one run to another. The slope optimism should be around or smaller than 0.01, so there should be no evidence of overfitting.

```
> m.int <- lm(Mean_RT ~ Length*log1p(Freq), data = ldt[-c(29, 100),])
> anova(m1, m.int)
Analysis of Variance Table
Model 1: Mean_RT ~ Length + log1p(Freq)
Model 2: Mean_RT ~ Length * log1p(Freq)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 95 757650
2 94 753181 1 4469.3 0.5578 0.457
```

From the large *p*-value we can infer that the interaction is not statistically significant.