Ch. 11 | Exercise 1

Chapter 11 | Exercise 1

Case study
Some like(d) it hot: A distinctive collexeme analysis of adjective hot in the 19th and 20th centuries’

Distinctive collexeme analysis can help one compare synchronic or diachronic variants. The package Rling contains two data sets, hot_old and hot_new. The former contains nouns that were used in the construction hot + N during the period from 1810 to 1909, whereas the latter contains the data from 1910 to 2009. The source of data is COHA bigrams (Davies 2011).

1.

Explore the data sets. What is the number of tokens of the construction in the old and new data? What is the most frequent noun in each data set?

2.

Merge the data, so that you have a list of all nouns that occur at least in one file, and replace the NA values with zeros. How many collexemes does the data frame contain?

3.

Perform a distinctive collexeme analysis of hot + N in the earlier and later periods by using log-transformed Fisher’s exact test p-values. Examine the top twenty collexemes that are most distinctive of the earlier period, and the top collexemes of the second period. How can you interpret the differences?

4.

Look at the entire list of collexemes with the distinctiveness scores that correspond to untransformed p-values smaller than 0.05. Can you make any additional observations?

1.

Load the data sets and compute the sums:

> library(Rling) > data(hot_old) > data(hot_new) > sum(hot_old$Old) [1] 5678 > sum(hot_new$New) [1] 14850

The earlier period is represented by 5678 tokens of the construction, whereas the later period is represented by 14850 tokens.

> hot_old[hot_old$Old == max(hot_old$Old),] Noun Old 512 water 628 > hot_new[hot_new$New == max(hot_new$New),] Noun New 767 water 1544

The noun water is the most frequent collexeme in both periods.

2.
> hot <- merge(hot_old, hot_new, by = "Noun", all = TRUE) > hot[is.na(hot)] <- 0 > nrow(hot) [1] 828

The resulting data frame contains 828 collexemes.

3.

First, create vectors with frequencies a, b, c and d from Table 11.1 in Chapter 11.

> a <- hot$Old > b <- hot$New > c <- sum(hot$Old) - a > d <- sum(hot$New) - b > aExp <- (a + b)*(a + c)/(a + b + c + d)

Next, compute the p-values and transform them logarithmically, taking into account the direction of association:

> pvF <- pv.Fisher.collostr(a, b, c, d) > logpvF <- ifelse(a<aExp, log10(pvF), -log10(pvF)) > hot$logp <- logpvF

Finally, investigate the top distinctive collexemes of both periods. First, inspect the top twenty nouns of the earlier period:

> hot <- hot[order(-hot$logp),] > hot[1:20,] Noun Old New logp 223 haste 114 17 45.343078 42 blood 200 126 36.729132 472 tears 150 83 31.044482 139 dish 66 23 19.032161 518 weather 246 294 18.615976 178 fire 90 52 18.217294 457 supper 57 24 14.815404 82 cheek 27 8 8.789374 531 work 35 18 8.187216 313 oven 34 17 8.131309 430 steam 24 7 7.924053 95 coals 76 84 6.999201 69 brow 24 10 6.624686 90 climates 28 15 6.477057 80 chase 17 4 6.242693 243 iron 110 152 6.197454 89 climate 34 25 5.855295 21 atmosphere 13 2 5.496178 356 pursuit 103 147 5.353176 194 forehead 36 31 5.154353

Below are the top collexemes of the more recent period:

> hot <- hot[order(hot$logp),] > hot[1:20,] Noun Old New logp 145 dogs 1 428 -58.581932 144 dog 3 346 -43.832331 783 spots 0 157 -21.996686 85 chocolate 13 232 -18.613889 497 tub 2 154 -18.445749 678 line 0 110 -15.345013 388 sauce 1 113 -14.206733 425 spot 1 109 -13.539281 769 shower 0 91 -12.525908 186 flashes 2 101 -11.447787 759 seat 0 67 -9.335859 709 pants 0 62 -8.470776 27 bath 33 230 -8.423899 311 oil 6 90 -6.651835 339 plate 10 107 -6.396065 328 pepper 1 51 -5.876450 347 potato 5 75 -5.693748 190 food 8 91 -5.633469 444 stuff 6 77 -5.240822 628 gas 0 37 -5.056945

Some differences relate to the apparent changes in lifestyle, food and communications (cf. hot oven, hot coals, hot iron in the earlier period, and hot dog, hot spot, hot chocolate, hot tub, hot shower, hot pants in the more modern texts). The earlier data also contains a few nouns that refer to body fluids and body parts hot blood, tears, cheek, brow, forehead. These expressions describe symptoms of an agitated emotional state. This usage seems to be less typical of contemporary written texts. In the more recent data, there is a whole set of new meanings, e.g. “direct”, as in hot line, “trendy”, as in hot stuff, and “spicy” (hot food, hot pepper, hot sauce).

4.

The distinctiveness scores that correspond to p-value smaller than 0.05 are either lower than –1.3, or higher than 1.3:

> log10(0.05) [1] -1.30103

Below is code that can be used to retrieve the full lists (the collexeme lists are not given due to space limitations). To obtain all distinctive collexemes of hot in the earlier period, one can type in the following command:

> hot[hot$logp > 1.3,] [output omitted]

To obtain the distinctive nouns for the later period, one can use the following code:

> hot[hot$logp<(-1.3),] [output omitted]

Some observations can be made: