Chapter 15 | Exercise 1
‘English plus Dutch’
Compare the distributional properties of English and Dutch analytic causatives. The Dutch causatives are constructions with auxiliaries doen “do” and laten “let” and a bare infinitive. The data can be found in the data set caus_Dutch
in the package Rling
. The data set contains fifty observations with the causative auxiliary doen and fifty observations with laten. The variables are the same as in the case study of the English causatives.
Create the Behavioural Profiles of these two Dutch constructions and add them to the matrix with the profiles of the English causatives.
Create a matrix of distances between all causative constructions using the Canberra distance. Which English construction is the most similar to doen? To laten? With the help of the Ward clustering method, create a joint dendrogram of English and Dutch causatives.
Try different combinations of various distance metrics (Euclidean and Manhattan distances), and clustering methods (average and complete). Do the results remain similar?
Find the optimal number of clusters on the basis of the average silhouette width and add the corresponding rectangles to the dendrogram.
Check if the clusters are supported by the data with the help of multiscale bootstrap resampling.
Find which features are the most distinctive of the Dutch causatives in comparison with the English ones. Make a snake plot.
Do the results of hierarchical clustering converge with partitioning around medoids? Choose the best clustering solution according to the average silhouette width.
You will need the object caus.bp
, which is created in Chapter 15.
> library(Rling)
> data(caus_Dutch)
> doen_V <- caus_Dutch[caus_Dutch$Cx == "doen_V", -1]
> doen_V.bp <- bp(doen_V)
> laten_V <- caus_Dutch[caus_Dutch$Cx == "laten_V", -1]
> laten_V.bp <- bp(laten_V)
> caus1.bp <- rbind(caus.bp, doen_V.bp, laten_V.bp)
> caus1.dist <- dist(caus1.bp, method = "canberra")
The smallest distances and therefore the greatest similarity is between doen_V and make_V (4.05), and between laten_V and have_V (4.06).
> caus1.hc <- hclust(caus1.dist, method = "ward.D2")
> plot(caus1.hc)
Try computing different distance metrics:
> test.dist <- dist(caus1.bp, method = "canberra")
> test.dist <- dist(caus1.bp) #Euclidean distance, the default
> test.dist <- dist(caus1.bp, method = "manhattan")
Try different clustering methods:
> test.hc <- hclust(test.dist, method = "ward")
> test.hc <- hclust(test.dist, method = "average")
> test.hc <- hclust(test.dist, method = "complete")
> test.hc <- hclust(test.dist, method = "single")
Plot a solution:
> plot(test.hc)
Overall, doen and laten are almost always clustered together, except for the single method in combination with the Euclidean and Canberra distances. The Dutch auxiliaries also tend to be clustered together with cause_toV, make_V and have_Ving.
> library(cluster)
> test.clust <- cutree(caus1.hc, k = 2)
> summary(silhouette(test.clust, caus1.dist))$avg.width
[1] 0.2599943
Repeat the steps for k from 3 to 10. The maximum average silhouette width is 0.33 for k = 5.
> rect.hclust(caus1.hc, k = 5)
> library(pvclust)
> caus1.pvc <- pvclust(t(caus1.bp), method.hclust = "ward.D2", method.dist = "canberra")
> plot(caus1.pvc)
The results may vary slightly every time you perform the procedure, but the only cluster that is close to 95% stability is have_Ved and get_Ved.
> c1 <- caus1.bp[10:11,]
> c2 <- caus1.bp[1:9,]
> c1.bp <- colMeans(c1)
> c2.bp <- colMeans(c2)
> diff <- c1.bp - c2.bp
> plot(sort(diff)*1.2, 1:length(diff), type = "n", xlab = "cluster 2 <---> cluster 1", yaxt = "n", ylab = "")
> text(sort(diff), 1:length(diff), names(sort(diff)), cex = 1.5)
One can see from the snake plot that the most remarkable difference is the higher proportion of mental caused events in the Dutch data. Another interesting difference is that there are more inanimate Causers. This type of causation with inanimate Causers and mental caused events is in fact called affective causation and can be exemplified by the following construction with doen:
(1) Je kapsel doet me denken aan een vogelnest.
“Your hairstyle makes me think of a bird's nest”.
However, this is only typical of doen. For laten, the high predominance of mental caused events is explained by the fact that a conventional way of saying “to show” in Dutch is laten zien, lit. “let see”.
> test.clust <- pam(caus1.dist, 2)
> test.clust$silinfo$avg.width
[1] 0.1909347
> test.clust <- pam(caus1.dist, 3)
> test.clust$silinfo$avg.width
[1] 0.2292607
… etc.
The maximum average silhouette width is observed when the number of clusters is five.
> test.clust <- pam(caus1.dist, 5)
> test.clust$clustering
make_V.bp be_made_toV.bp cause_toV.bp get_toV.bp
1 2 1 2
get_Ved.bp have_V.bp have_Ved.bp have_Ving.bp get_Ving.bp
3 2 3 1 4
doen_V.bp laten_V.bp
5 5
The clustering is identical to the solution produced by the hierarchical clustering method.