Ch. 7 | Exercise 1

Chapter 7 | Exercise 1

For this case study, you will need the data set ldt from the Rling package, which was discussed in Chapters 3 and 6.

Fit a linear regression model with the mean reaction times as the dependent variable and log-transformed corpus word frequency Freq (add 1 to avoid -Inf) and word length Length as predictors. Is the model significant? What is the predictive power of the model? What is the effect of the predictors on the response? Are these effects statistically significant?

Compute the 95% and 99% confidence intervals of the regression estimates in the model.

According to the fitted model, how much time on average would it take to recognize a word with 5 letters and the corpus frequency of 100? Compute manually the fitted value.

Which variables survive backward, forward and bidirectional stepwise selection?

Check if the linearity assumption is met with the help of the component-residual plot.

Check if the residuals are distributed homoscedastically.

Test the model for multicollinearity between the predictors. Is it acceptable?

Are the residuals distributed normally?

Find two dangerous outliers and fit a new model without them. Can you see the difference in the new model? What about the distribution of residuals and heteroscedasticity?

10.

Check if the new model overfits the data by using 200 bootstrap samples.

11.

Test whether there is significant interaction between the explanatory variables in the new model.

Key

Load the data and fit a linear regression model:

> library(Rling)
> data(ldt)
> m <- lm(Mean_RT ~ Length + log1p(Freq), data = ldt)
> summary(m)

Call:
lm(formula = Mean_RT ~ Length + log1p(Freq), data = ldt)

Residuals:
    Min      1Q  Median      3Q     Max 
-237.14  -72.58  -13.03   46.35  565.58 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  714.282     63.473  11.253  < 2e-16 ***
Length        26.132      5.237   4.990 2.66e-06 ***
log1p(Freq)  -21.313      4.971  -4.287 4.27e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 111.9 on 97 degrees of freedom
Multiple R-squared:  0.477,     Adjusted R-squared:  0.4662 
F-statistic: 44.24 on 2 and 97 DF,  p-value: 2.22e-14

The p-value based on the F-statistic is very small, which means that the model in general is significant. The R² statistic is 0.477, which suggests that the model has some explanatory power, although probably not all relevant factors are taken into account. The estimated coefficient of Length is 26.132. This means that with every additional letter of a stimulus, the reaction time increases by 26.132 ms. The coefficient of log-transformed Freq is -21.313. This means that with every unit of log-transformed frequency plus 1, the reaction time decreases by 21.313 ms. These effects are statistically significant.

Compute the 95% confidence intervals (the default):

> confint(m)
                2.5 %    97.5 %
(Intercept) 588.30431 840.25871
Length       15.73760  36.52571
log1p(Freq) -31.17832 -11.44674

Compute the 99% confidence intervals:

> confint(m, level = 0.99)
                0.5 %     99.5 %
(Intercept) 547.50710 881.055913
Length       12.37153  39.891786
log1p(Freq) -34.37332  -8.251747

The fitted value is as follows:

> 714.28 + 5*26.13 - 21.31*(log1p(100))
[1] 746.5818

Both variables survive all three variable selection procedures. The code is as follows:

> m0 <- lm(Mean_RT ~ 1, data = ldt)
> step(m0, scope = ~ Length + log1p(Freq), direction = "forward") # forward selection
> step(m, direction = "backward") # backward selection
> step(m0, scope = ~ Length + log1p(Freq)) # bidirectional selection

All three methods converge: both explanatory variables contribute to the model substantially.

The component-residual plots do not reveal marked deviations from linearity:

> library(car) #if you haven’t loaded it yet
> crPlot(m, var = "Length")
> crPlot(m, var = "log1p(Freq)")

> library(car) #if you haven’t loaded it yet
> plot(m, which = 1)
> ncvTest(m)
Non-constant Variance Score Test 
Variance formula: ~ fitted.values 
Chisquare = 11.85373    Df = 1     p = 0.0005754606 
> ncvTest(m, ~ Length)
Non-constant Variance Score Test 
Variance formula: ~ Length 
Chisquare = 16.11164    Df = 1     p = 5.971584e-05
> ncvTest(m, ~ log1p(Freq))
Non-constant Variance Score Test 
Variance formula: ~ log1p(Freq) 
Chisquare = 3.289005    Df = 1     p = 0.06974526

The diagnostic plot and non-constant variance tests suggest that there is some heteroschedasticity, in particular, in the relationship between the response and word length.

> library(car) # if you haven’t done so yet
> vif(m)
     Length log1p(Freq) 
   1.356621    1.356621

The VIF-scores are too low to suspect multicollinearity.

> shapiro.test(m$residuals)

        Shapiro-Wilk normality test

data:  m$residuals
W = 0.89934, p-value = 1.317e-06

The residuals are not normally distributed.

> library(car) # if you haven’t done so yet
> influencePlot(m, id.method = "identify")

The word diacritical has a very high residual and Cook’s score, followed by dessertspoon.

> m1 <- lm(Mean_RT ~ Length + log1p(Freq), data = ldt[-c(29, 100),])
> summary(m1)
[output omitted]

From the summary one can see that the coefficient of Length has become smaller, and both R² measures have improved. Moreover, the non-constant variance test shows no significant evidence of heteroschedasticity any more, and the residuals are now normally distributed:

> ncvTest(m1)
Non-constant Variance Score Test 
Variance formula: ~ fitted.values 
Chisquare = 3.002491    Df = 1     p = 0.08313662
> ncvTest(m1, ~ Length)
Non-constant Variance Score Test 
Variance formula: ~ Length 
Chisquare = 3.720857    Df = 1     p = 0.05373677

> shapiro.test(m1$residuals)

        Shapiro-Wilk normality test

data:  m1$residuals
W = 0.98278, p-value = 0.2288

10.

> library(rms)
> m.ols <- ols(Mean_RT ~ Length + log1p(Freq), data = ldt[-c(29, 100),], x = TRUE, y = TRUE)
> validate(m.ols, B = 200)
[output omitted]

Since the algorithm involves random sampling from the original data set, the results will vary from one run to another. The slope optimism should be around or smaller than 0.01, so there should be no evidence of overfitting.

11.

> m.int <- lm(Mean_RT ~ Length*log1p(Freq), data = ldt[-c(29, 100),])
> anova(m1, m.int)
Analysis of Variance Table

Model 1: Mean_RT ~ Length + log1p(Freq)
Model 2: Mean_RT ~ Length * log1p(Freq)
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1     95 757650                           
2     94 753181  1    4469.3 0.5578  0.457

From the large p-value we can infer that the interaction is not statistically significant.