Ch. 3 | Exercise 2

Chapter 3 | Exercise 2

Use the late Adam Kilgarriff’s web page http://www.kilgarriff.co.uk/BNClists/lemma.num to access the top twenty most frequent words in the British National Corpus. Save the selection as a text file.

Open the data in R.

Create a histogram, a density plot and a Q-Q plot for logarithmically transformed frequencies.

Use the Shapiro-Wilk normality test to see whether the log-transformed frequencies come from a normally distributed population.

Key

create a .txt file and open the data in R:

> bnc <- read.table("C:\\Your\\Directory\\bnc.txt")
> head(bnc)
  V1      V2  V3   V4
1  1 6187267 the  det
2  2 4239632  be    v
3  3 3093444  of prep
4  4 2687863 and conj
5  5 2186369   a  det
6  6 1924315  in prep

Note that the columns have no names. For convenience, you may want to create a vector freq with the frequencies shown in column 2:

> freq <- bnc[, 2]

> hist(log(freq), main = "Histogram of log-transformed BNC frequencies")
> plot(density(log(freq)), main = "Histogram of log-transformed BNC frequencies")
> qqnorm(log(freq))
> qqline(log(freq))

> shapiro.test(log(freq))

	Shapiro-Wilk normality test

data:  log(freq)
W = 0.9104, p-value = 0.06474

The p-value > 0.05 in the Shapiro-Wilk test does not allow us to discard the null hypothesis that the log-transformed data come from a normally distributed population.