Chapter 10 | Exercise 4
Using the frequencies in Exercise 3, compute the log-likelihood scores for each collexeme. Which adjective has the highest log-likelihood score, and which has the lowest one? The total number of words in COCA at the moment of query was 464 020 256.
Create the vectors with the frequencies b, c, d and expected a:
> b <- total - a > c <- 28636 - a > d <- 464020256 - a - b - c > aExp <- (a + b)*(a + c)/(a + b + c + d)
Compute the log-likelihood ratio and add the information about the direction of the relationship:
> library(Rling) > loglik <- LL.collostr(a, b, c, d) > loglik1 <- ifelse(a<aExp, -loglik, loglik) > names(loglik1) <- adj > sort(loglik1, decreasing = T) crazy wrong haywire blank unpunished 22406.529538 7498.978088 4056.346901 3560.834797 3214.194939 undetected stir-crazy batty hog-wild sick 3060.021742 231.245176 210.705804 207.630165 4.910525
The highest score belongs to crazy, and the lowest to sick.