Chapter 3 | Exercise 2
Use the late Adam Kilgarriff’s web page http://www.kilgarriff.co.uk/BNClists/lemma.num to access the top twenty most frequent words in the British National Corpus. Save the selection as a text file.
Open the data in R.
Create a histogram, a density plot and a Q-Q plot for logarithmically transformed frequencies.
Use the Shapiro-Wilk normality test to see whether the log-transformed frequencies come from a normally distributed population.
create a .txt file and open the data in R:
> bnc <- read.table("C:\\Your\\Directory\\bnc.txt") > head(bnc) V1 V2 V3 V4 1 1 6187267 the det 2 2 4239632 be v 3 3 3093444 of prep 4 4 2687863 and conj 5 5 2186369 a det 6 6 1924315 in prep
Note that the columns have no names. For convenience, you may want to create a vector
freq with the frequencies shown in column 2:
> freq <- bnc[, 2]
> hist(log(freq), main = "Histogram of log-transformed BNC frequencies") > plot(density(log(freq)), main = "Histogram of log-transformed BNC frequencies") > qqnorm(log(freq)) > qqline(log(freq))
> shapiro.test(log(freq)) Shapiro-Wilk normality test data: log(freq) W = 0.9104, p-value = 0.06474
The p-value > 0.05 in the Shapiro-Wilk test does not allow us to discard the null hypothesis that the log-transformed data come from a normally distributed population.