Ch. 16 | Exercise 1

Chapter 16 | Exercise 1

Case study
‘Oh baby, you’re so special!’

An important question in studies based on Vector Space Models is whether it is preferable to use words (lemmas) or word forms. Sometimes different forms of the same word can have very distinct distributional and semantic properties. This idea is expressed in the notion of inflectional islands used by Newman and Rice (2006). Speakers need to learn the distributional properties of each form. The higher the frequency and entrenchment of a particular form, the greater the chances that it will form an island. This case study illustrates the idea. The data set babyinfant is a matrix for a bag-of-words vector space model. It contains nouns that co-occur with the target word forms baby, babies, infant and infants. The context window was five words to the left and five words to the right from the target word form. The data were taken from COCA (only year 2012). Create a Vector Space model of the target word forms and interpret it, following the procedure below.

1.

Explore the data set. How many rows (feature words) does it contain?

2.

Compute the expected frequencies, Pointwise Mutual Information and Positive Pointwise Mutual Information for each of the four target word forms.

3.

Combine the vectors represented by the PPMIs as rows in one table and compute the distances between the target word forms using the cosine similarity measure. Which word forms are the most similar distributionally? The least similar?

4.

Create a hierarchical clustering model using Ward’s method and plot the tree.

5.

Are the results surprising? Why (not)?

1.
> library(Rling) > data(babyinfant) > nrow(babyinfant) [1] 1233

or

> dim(babyinfant) [1] 1233 4

The dataset contains 1233 feature words (rows).

2.

Follow the algorithm:

> exp.baby <- sum(babyinfant$Baby)*rowSums(babyinfant)/sum(babyinfant) > PMI.baby <- log2(babyinfant$Baby/exp.baby) > PPMI.baby <- ifelse(PMI.baby < 0, 0, PMI.baby)

Repeat for the remaining word forms.

3.
> babyinfant.w <- rbind(PPMI.baby, PPMI.babies, PPMI.infant, PPMI.infants) > babyinfant.dist <- as.dist(1 - cossim(babyinfant.w)) > babyinfant.dist PPMI.baby PPMI.babies PPMI.infant PPMI.babies 0.9977191 PPMI.infant 0.9976828 0.9681316 PPMI.infants 0.9973640 0.9630300 0.9279164

The closest word forms are infant and infants (0.9279164). The farthest are baby and babies (0.9977191).

4.
> babyinfant.hc <- hclust(babyinfant.dist, method = "ward.D2") > plot(babyinfant.hc)

The plot will show the word form baby apart from the other forms.

5.

The results are not surprising. The singular form of baby, unlike the other target word forms, is used as a term of address expressing affection, to express size (e.g. baby carrot) and in other figurative meanings (baby face).