Most existing measures of lexical diversity are either direct or indirect measures of the proportion of repeated words in a language sample, and they tend to be validated in accordance with how well they avoid sample-size effects and/or how strongly they correlate with measures of knowledge and proficiency. The present paper argues that such measures suffer from the lack of construct validity in two ways: (a) They are not grounded in an adequate or clearly articulated theoretical account of the nature of the construct of lexical diversity, and (b) they are not validated in relation to how well they measure lexical diversity itself, but rather in relation to how well they do or do not correlate with other constructs. The present paper proposes solutions to both of these problems by defining lexical diversity as a perception-based phenomenon with six measurable properties, and by calibrating the six objective properties against human judgments of lexical diversity. The purpose of the empirical portion of the paper is to determine how well a statistical model constructed on the basis of the proposed six objective properties is able to account for nine human raters’ judgments of the lexical diversity found in 50 narratives written by adolescent learners and native speakers of English. The results support the proposed six-dimensional construct of lexical diversity, but also suggest the need for further refinements to how the six properties are measured.
Issues of lexical diversity assessment have only been addressed with consideration of the approach, rather than the corpus. Of necessity, intrinsic issues of lexical diversity related to the approach needed to be addressed first; however, given that they now have received due attention in recent research, it is time to turn our attention to extrinsic issues of lexical diversity, which is the assessment of how variations in texts and corpora affect the results of the approach. The focus of intrinsic issues has been on the algorithms and approaches used to produce values of lexical diversity on laboratory-like data sets. With extrinsic issues of word count, the focus moves to more naturalistic data sets with texts that demonstrate ranges of inconsistencies in terms of size, quality, and length. For these data, indices of lexical diversity are required to demonstrate ecological validity. The degree to which an index of lexical diversity exhibits ecological validity is of considerable importance to the field of second language learning because naturalistic corpora vary considerably in size, and texts within the corpora vary considerably in terms of word count. In other words, ecological validity is a necessary element of the construct validity of lexical diversity. In this study, we assess the three primary indices of lexical diversity (MTLD, HD-D, and Maas) using a corpus of naturalistic data in order to evaluate extrinsic issues of lexical diversity assessment by way of ecological validation. Our results show that the index of MTLD appears strongest and the index of Maas appears the least strong. Our conclusion, while encouraging broader research, is that the Maas index be abandoned as a lexical diversity index because of its over-sensitivity to word count. By contrast, MTLD appears to be resilient to a wide range of extrinsic factors and, consequently, is recommended for future lexical diversity studies.
In this study two new measures of lexical diversity are tested for the first time on French. The usefulness of these measures, MTLD (McCarthy and Jarvis (2010 and this volume) and HD-D (McCarthy and Jarvis 2007), in predicting different aspects of language proficiency is assessed and compared with D (Malvern and Richards 1997; Malvern, Richards, Chipere and Durán 2004) and Maas (1972) in analyses of stories told by two groups of learners (n = 41) of two different proficiency levels and one group of native speakers of French (n = 23). The importance of careful lemmatization in studies of lexical diversity which involve highly inflected languages is also demonstrated. The paper shows that the measures of lexical diversity under study are valid proxies for language ability in that they explain up to 62 percent of the variance in French C-test scores, and up to 33 percent of the variance in a measure of complexity. The paper also provides evidence that dependence on segment size continues to be a problem for the measures of lexical diversity discussed in this paper. The paper concludes that limiting the range of text lengths or even keeping text length constant is the safest option in analysing lexical diversity.
This study examines the convergent validity of a wide range of computational indices reported by Coh-Metrix that have been associated in past studies with lexical features such as basic category words, semantic co-referentiality, word frequency, and lexical diversity. This study uses human judgments of these lexical features as found in free-writing samples as operationalizations of the lexical constructs the indices are meant to measure. Statistical analyses were then conducted to examine the convergent validity of each index and to assess the predictive ability of the indices that correlate strongest with the human judgments to explain holistic scores of lexical proficiency in L1 and L2 speakers. Correlations between the automated lexical indices and the operationalized constructs demonstrated small to large effect sizes providing a degree of convergent validity for most of the automated indices examined in this study. A multiple regression predicting holistic judgments of lexical proficiency using these automated lexical indices explained 40% of the variance in a training set and 37% of the variance in a test set. The findings from the study provide a degree of confidence that the indices are measuring the constructs they were predicted to measure.
This study investigates the potential for computational models informed through automated lexical indices to simulate human ratings of word concreteness, word familiarity, and word imageability. The goal of the study is to provide word information estimates for words with human ratings, thereby affording greater textual coverage and permitting a better understanding of features that underlie word properties. This study uses traditional automated word features such word length, word frequency, hypernymy, and polysemy along with novel automated word features such as word type attributes taken from WordNet, LSA dimensions, and inverse entropy weights as predictor variables. The model reported in this study for word concreteness predicted 61% of the variance in human ratings of word concreteness and demonstrated that more concrete words contain attributes related to people, animals, and food, have higher hypernymy levels, are related to two LSA dimensions, are more frequent, and are shorter. The model for word familiarity predicted 62% of the variance in the human ratings reported in the MRC database and demonstrated that more familiar words are found in a greater number of text samples and are more frequent. The model for word imageability ratings explained 42% of the variance in the human ratings and demonstrated that more concrete words contain attributes related to artifacts, animals, and plants, are related to two LSA dimensions, are more frequent, and are shorter.
In this paper we propose a frequency-based model of vocabulary acquisition and test it on texts written by second language (L2) writers of English. One goal of the paper is to address an issue that has arisen in previous work attempting to verify Laufer and Nation’s (1995) proposal for using lexical frequency profiling tools with L2 texts to estimate the underlying vocabulary size of the L2 writers. That issue is the application of Zipf’s law (1935, 1949) directly to student texts (see Meara, 2005; Edwards & Collins, 2011), which assumes that words are learned in the order of their frequency in the language at large. As this is clearly not the case, a more valid model of vocabulary learning needs to account for the presence of less common words at different points of the acquisition process. Our model supposes that learning consists of a sequence of exposures to words, seen in proportion to their frequency in the language as a whole, and that some number of exposures are required for a word to be learned (a model parameter). This allows calculation of the probabilities that a given word (whether common or uncommon) is learned after a given number of exposures in this sequence. Furthermore, it allows calculation of the likelihood that a word is used once it has been learned, based on the word’s rank in the learner’s interlanguage (we also considered the possibility of basing this step on the word’s rank in the L2 as a whole), from which we can predict frequency distributions for learner texts. For a given 1K word count in texts, the model predicts a smaller underlying productive vocabulary than predicted by the naïve application of Zipf’s law. We then fit the parameters of the model to texts written by 90 francophone ESL learners at different points of a five-month intensive program. The best fit was obtained with a ‘number of exposures’ parameter value of 3. The model reproduces the steeper-than-Zipf tail of the frequency distribution of words observed in texts.
Many studies in a variety of educational contexts show that learning curves are non-linear (e.g. Freedman, 1987 for the development of story telling skills in the first language, DeKeyser, 1997 for the acquisition of morphosyntactic rules of an artificial second language or Brooks and Meltzoff, 2007 for the development of vocabulary in two-year-old infants), but there is no agreement on the best non-linear model which may vary between different contexts. Although there are strong arguments, both on empirical and on theoretical grounds, that a power curve is appropriate in most educational settings (Newell & Rosenbloom, 1981; Ninio 2007) other models have also been proposed (Van de gaer et al., 2009; Verhoeven & Van Leeuwe, 2009). However, little is known about the long-term patterns of vocabulary learning in a foreign language. In the present study we analyse the vocabulary used in 294 essays by 42 students written at regular intervals over a period of two years. We use several measures that focus on vocabulary richness as well as ratings from trained IELTS teachers. Our analysis is supported with structural equation modelling, where a latent learning curve, based on the power law, can be identified. The present study is relevant for the discussion on methodological approaches in the measurement of vocabulary knowledge but also has pedagogical implications, as it allows teachers to identify when a certain plateau has been reached and when further vocabulary learning is only effective with additional pedagogical intervention.
Most existing measures of lexical diversity are either direct or indirect measures of the proportion of repeated words in a language sample, and they tend to be validated in accordance with how well they avoid sample-size effects and/or how strongly they correlate with measures of knowledge and proficiency. The present paper argues that such measures suffer from the lack of construct validity in two ways: (a) They are not grounded in an adequate or clearly articulated theoretical account of the nature of the construct of lexical diversity, and (b) they are not validated in relation to how well they measure lexical diversity itself, but rather in relation to how well they do or do not correlate with other constructs. The present paper proposes solutions to both of these problems by defining lexical diversity as a perception-based phenomenon with six measurable properties, and by calibrating the six objective properties against human judgments of lexical diversity. The purpose of the empirical portion of the paper is to determine how well a statistical model constructed on the basis of the proposed six objective properties is able to account for nine human raters’ judgments of the lexical diversity found in 50 narratives written by adolescent learners and native speakers of English. The results support the proposed six-dimensional construct of lexical diversity, but also suggest the need for further refinements to how the six properties are measured.
Issues of lexical diversity assessment have only been addressed with consideration of the approach, rather than the corpus. Of necessity, intrinsic issues of lexical diversity related to the approach needed to be addressed first; however, given that they now have received due attention in recent research, it is time to turn our attention to extrinsic issues of lexical diversity, which is the assessment of how variations in texts and corpora affect the results of the approach. The focus of intrinsic issues has been on the algorithms and approaches used to produce values of lexical diversity on laboratory-like data sets. With extrinsic issues of word count, the focus moves to more naturalistic data sets with texts that demonstrate ranges of inconsistencies in terms of size, quality, and length. For these data, indices of lexical diversity are required to demonstrate ecological validity. The degree to which an index of lexical diversity exhibits ecological validity is of considerable importance to the field of second language learning because naturalistic corpora vary considerably in size, and texts within the corpora vary considerably in terms of word count. In other words, ecological validity is a necessary element of the construct validity of lexical diversity. In this study, we assess the three primary indices of lexical diversity (MTLD, HD-D, and Maas) using a corpus of naturalistic data in order to evaluate extrinsic issues of lexical diversity assessment by way of ecological validation. Our results show that the index of MTLD appears strongest and the index of Maas appears the least strong. Our conclusion, while encouraging broader research, is that the Maas index be abandoned as a lexical diversity index because of its over-sensitivity to word count. By contrast, MTLD appears to be resilient to a wide range of extrinsic factors and, consequently, is recommended for future lexical diversity studies.
In this study two new measures of lexical diversity are tested for the first time on French. The usefulness of these measures, MTLD (McCarthy and Jarvis (2010 and this volume) and HD-D (McCarthy and Jarvis 2007), in predicting different aspects of language proficiency is assessed and compared with D (Malvern and Richards 1997; Malvern, Richards, Chipere and Durán 2004) and Maas (1972) in analyses of stories told by two groups of learners (n = 41) of two different proficiency levels and one group of native speakers of French (n = 23). The importance of careful lemmatization in studies of lexical diversity which involve highly inflected languages is also demonstrated. The paper shows that the measures of lexical diversity under study are valid proxies for language ability in that they explain up to 62 percent of the variance in French C-test scores, and up to 33 percent of the variance in a measure of complexity. The paper also provides evidence that dependence on segment size continues to be a problem for the measures of lexical diversity discussed in this paper. The paper concludes that limiting the range of text lengths or even keeping text length constant is the safest option in analysing lexical diversity.
This study examines the convergent validity of a wide range of computational indices reported by Coh-Metrix that have been associated in past studies with lexical features such as basic category words, semantic co-referentiality, word frequency, and lexical diversity. This study uses human judgments of these lexical features as found in free-writing samples as operationalizations of the lexical constructs the indices are meant to measure. Statistical analyses were then conducted to examine the convergent validity of each index and to assess the predictive ability of the indices that correlate strongest with the human judgments to explain holistic scores of lexical proficiency in L1 and L2 speakers. Correlations between the automated lexical indices and the operationalized constructs demonstrated small to large effect sizes providing a degree of convergent validity for most of the automated indices examined in this study. A multiple regression predicting holistic judgments of lexical proficiency using these automated lexical indices explained 40% of the variance in a training set and 37% of the variance in a test set. The findings from the study provide a degree of confidence that the indices are measuring the constructs they were predicted to measure.
This study investigates the potential for computational models informed through automated lexical indices to simulate human ratings of word concreteness, word familiarity, and word imageability. The goal of the study is to provide word information estimates for words with human ratings, thereby affording greater textual coverage and permitting a better understanding of features that underlie word properties. This study uses traditional automated word features such word length, word frequency, hypernymy, and polysemy along with novel automated word features such as word type attributes taken from WordNet, LSA dimensions, and inverse entropy weights as predictor variables. The model reported in this study for word concreteness predicted 61% of the variance in human ratings of word concreteness and demonstrated that more concrete words contain attributes related to people, animals, and food, have higher hypernymy levels, are related to two LSA dimensions, are more frequent, and are shorter. The model for word familiarity predicted 62% of the variance in the human ratings reported in the MRC database and demonstrated that more familiar words are found in a greater number of text samples and are more frequent. The model for word imageability ratings explained 42% of the variance in the human ratings and demonstrated that more concrete words contain attributes related to artifacts, animals, and plants, are related to two LSA dimensions, are more frequent, and are shorter.
In this paper we propose a frequency-based model of vocabulary acquisition and test it on texts written by second language (L2) writers of English. One goal of the paper is to address an issue that has arisen in previous work attempting to verify Laufer and Nation’s (1995) proposal for using lexical frequency profiling tools with L2 texts to estimate the underlying vocabulary size of the L2 writers. That issue is the application of Zipf’s law (1935, 1949) directly to student texts (see Meara, 2005; Edwards & Collins, 2011), which assumes that words are learned in the order of their frequency in the language at large. As this is clearly not the case, a more valid model of vocabulary learning needs to account for the presence of less common words at different points of the acquisition process. Our model supposes that learning consists of a sequence of exposures to words, seen in proportion to their frequency in the language as a whole, and that some number of exposures are required for a word to be learned (a model parameter). This allows calculation of the probabilities that a given word (whether common or uncommon) is learned after a given number of exposures in this sequence. Furthermore, it allows calculation of the likelihood that a word is used once it has been learned, based on the word’s rank in the learner’s interlanguage (we also considered the possibility of basing this step on the word’s rank in the L2 as a whole), from which we can predict frequency distributions for learner texts. For a given 1K word count in texts, the model predicts a smaller underlying productive vocabulary than predicted by the naïve application of Zipf’s law. We then fit the parameters of the model to texts written by 90 francophone ESL learners at different points of a five-month intensive program. The best fit was obtained with a ‘number of exposures’ parameter value of 3. The model reproduces the steeper-than-Zipf tail of the frequency distribution of words observed in texts.
Many studies in a variety of educational contexts show that learning curves are non-linear (e.g. Freedman, 1987 for the development of story telling skills in the first language, DeKeyser, 1997 for the acquisition of morphosyntactic rules of an artificial second language or Brooks and Meltzoff, 2007 for the development of vocabulary in two-year-old infants), but there is no agreement on the best non-linear model which may vary between different contexts. Although there are strong arguments, both on empirical and on theoretical grounds, that a power curve is appropriate in most educational settings (Newell & Rosenbloom, 1981; Ninio 2007) other models have also been proposed (Van de gaer et al., 2009; Verhoeven & Van Leeuwe, 2009). However, little is known about the long-term patterns of vocabulary learning in a foreign language. In the present study we analyse the vocabulary used in 294 essays by 42 students written at regular intervals over a period of two years. We use several measures that focus on vocabulary richness as well as ratings from trained IELTS teachers. Our analysis is supported with structural equation modelling, where a latent learning curve, based on the power law, can be identified. The present study is relevant for the discussion on methodological approaches in the measurement of vocabulary knowledge but also has pedagogical implications, as it allows teachers to identify when a certain plateau has been reached and when further vocabulary learning is only effective with additional pedagogical intervention.