Ch. 3 | Exercise 2

Chapter 3 | Exercise 2

Use the late Adam Kilgarriff’s web page http://​www​.kilgarriff​.co​.uk​/BNClists​/lemma​.num to access the top twenty most frequent words in the British National Corpus. Save the selection as a text file.

1.

Open the data in R.

2.

Create a histogram, a density plot and a Q-Q plot for logarithmically transformed frequencies.

3.

Use the Shapiro-Wilk normality test to see whether the log-transformed frequencies come from a normally distributed population.

create a .txt file and open the data in R:

> bnc <- read.table("C:\\Your\\Directory\\bnc.txt") > head(bnc) V1 V2 V3 V4 1 1 6187267 the det 2 2 4239632 be v 3 3 3093444 of prep 4 4 2687863 and conj 5 5 2186369 a det 6 6 1924315 in prep

Note that the columns have no names. For convenience, you may want to create a vector freq with the frequencies shown in column 2:

> freq <- bnc[, 2] > hist(log(freq), main = "Histogram of log-transformed BNC frequencies") > plot(density(log(freq)), main = "Histogram of log-transformed BNC frequencies") > qqnorm(log(freq)) > qqline(log(freq)) > shapiro.test(log(freq)) Shapiro-Wilk normality test data: log(freq) W = 0.9104, p-value = 0.06474

The p-value > 0.05 in the Shapiro-Wilk test does not allow us to discard the null hypothesis that the log-transformed data come from a normally distributed population.