Chapter 11 | Exercise 1
‘Some like(d) it hot: A distinctive collexeme analysis of adjective hot in the 19th and 20th centuries’
Distinctive collexeme analysis can help one compare synchronic or diachronic variants. The package Rling
contains two data sets, hot_old
and hot_new
. The former contains nouns that were used in the construction hot + N during the period from 1810 to 1909, whereas the latter contains the data from 1910 to 2009. The source of data is COHA bigrams (Davies 2011).
Explore the data sets. What is the number of tokens of the construction in the old and new data? What is the most frequent noun in each data set?
Merge the data, so that you have a list of all nouns that occur at least in one file, and replace the NA values with zeros. How many collexemes does the data frame contain?
Perform a distinctive collexeme analysis of hot + N in the earlier and later periods by using log-transformed Fisher’s exact test p-values. Examine the top twenty collexemes that are most distinctive of the earlier period, and the top collexemes of the second period. How can you interpret the differences?
Look at the entire list of collexemes with the distinctiveness scores that correspond to untransformed p-values smaller than 0.05. Can you make any additional observations?
Load the data sets and compute the sums:
> library(Rling)
> data(hot_old)
> data(hot_new)
> sum(hot_old$Old)
[1] 5678
> sum(hot_new$New)
[1] 14850
The earlier period is represented by 5678 tokens of the construction, whereas the later period is represented by 14850 tokens.
> hot_old[hot_old$Old == max(hot_old$Old),]
Noun Old
512 water 628
> hot_new[hot_new$New == max(hot_new$New),]
Noun New
767 water 1544
The noun water is the most frequent collexeme in both periods.
> hot <- merge(hot_old, hot_new, by = "Noun", all = TRUE)
> hot[is.na(hot)] <- 0
> nrow(hot)
[1] 828
The resulting data frame contains 828 collexemes.
First, create vectors with frequencies a, b, c and d from Table 11.1 in Chapter 11.
> a <- hot$Old
> b <- hot$New
> c <- sum(hot$Old) - a
> d <- sum(hot$New) - b
> aExp <- (a + b)*(a + c)/(a + b + c + d)
Next, compute the p-values and transform them logarithmically, taking into account the direction of association:
> pvF <- pv.Fisher.collostr(a, b, c, d)
> logpvF <- ifelse(a<aExp, log10(pvF), -log10(pvF))
> hot$logp <- logpvF
Finally, investigate the top distinctive collexemes of both periods. First, inspect the top twenty nouns of the earlier period:
> hot <- hot[order(-hot$logp),]
> hot[1:20,]
Noun Old New logp
223 haste 114 17 45.343078
42 blood 200 126 36.729132
472 tears 150 83 31.044482
139 dish 66 23 19.032161
518 weather 246 294 18.615976
178 fire 90 52 18.217294
457 supper 57 24 14.815404
82 cheek 27 8 8.789374
531 work 35 18 8.187216
313 oven 34 17 8.131309
430 steam 24 7 7.924053
95 coals 76 84 6.999201
69 brow 24 10 6.624686
90 climates 28 15 6.477057
80 chase 17 4 6.242693
243 iron 110 152 6.197454
89 climate 34 25 5.855295
21 atmosphere 13 2 5.496178
356 pursuit 103 147 5.353176
194 forehead 36 31 5.154353
Below are the top collexemes of the more recent period:
> hot <- hot[order(hot$logp),]
> hot[1:20,]
Noun Old New logp
145 dogs 1 428 -58.581932
144 dog 3 346 -43.832331
783 spots 0 157 -21.996686
85 chocolate 13 232 -18.613889
497 tub 2 154 -18.445749
678 line 0 110 -15.345013
388 sauce 1 113 -14.206733
425 spot 1 109 -13.539281
769 shower 0 91 -12.525908
186 flashes 2 101 -11.447787
759 seat 0 67 -9.335859
709 pants 0 62 -8.470776
27 bath 33 230 -8.423899
311 oil 6 90 -6.651835
339 plate 10 107 -6.396065
328 pepper 1 51 -5.876450
347 potato 5 75 -5.693748
190 food 8 91 -5.633469
444 stuff 6 77 -5.240822
628 gas 0 37 -5.056945
Some differences relate to the apparent changes in lifestyle, food and communications (cf. hot oven, hot coals, hot iron in the earlier period, and hot dog, hot spot, hot chocolate, hot tub, hot shower, hot pants in the more modern texts). The earlier data also contains a few nouns that refer to body fluids and body parts hot blood, tears, cheek, brow, forehead. These expressions describe symptoms of an agitated emotional state. This usage seems to be less typical of contemporary written texts. In the more recent data, there is a whole set of new meanings, e.g. “direct”, as in hot line, “trendy”, as in hot stuff, and “spicy” (hot food, hot pepper, hot sauce).
The distinctiveness scores that correspond to p-value smaller than 0.05 are either lower than –1.3, or higher than 1.3:
> log10(0.05)
[1] -1.30103
Below is code that can be used to retrieve the full lists (the collexeme lists are not given due to space limitations). To obtain all distinctive collexemes of hot in the earlier period, one can type in the following command:
> hot[hot$logp > 1.3,]
[output omitted]
To obtain the distinctive nouns for the later period, one can use the following code:
> hot[hot$logp<(-1.3),]
[output omitted]
Some observations can be made: