Chapter 3 | Exercise 2
Use the late Adam Kilgarriff’s web page http://www.kilgarriff.co.uk/BNClists/lemma.num to access the top twenty most frequent words in the British National Corpus. Save the selection as a text file.
Open the data in R.
Create a histogram, a density plot and a Q-Q plot for logarithmically transformed frequencies.
Use the Shapiro-Wilk normality test to see whether the log-transformed frequencies come from a normally distributed population.
create a .txt file and open the data in R:
> bnc <- read.table("C:\\Your\\Directory\\bnc.txt")
> head(bnc)
V1 V2 V3 V4
1 1 6187267 the det
2 2 4239632 be v
3 3 3093444 of prep
4 4 2687863 and conj
5 5 2186369 a det
6 6 1924315 in prep
Note that the columns have no names. For convenience, you may want to create a vector freq
with the frequencies shown in column 2:
> freq <- bnc[, 2]
> hist(log(freq), main = "Histogram of log-transformed BNC frequencies")
> plot(density(log(freq)), main = "Histogram of log-transformed BNC frequencies")
> qqnorm(log(freq))
> qqline(log(freq))
> shapiro.test(log(freq))
Shapiro-Wilk normality test
data: log(freq)
W = 0.9104, p-value = 0.06474
The p-value > 0.05 in the Shapiro-Wilk test does not allow us to discard the null hypothesis that the log-transformed data come from a normally distributed population.