Edited by Ana Díaz-Negrillo, Nicolas Ballier and Paul Thompson
[Studies in Corpus Linguistics 59] 2013
► pp. 169–204
This study reports on a new approach in semi-automatic error annotation and criterial feature extraction from learner corpora. Parallel learner corpora, a set of original learner writings and their proofread counterparts, were processed using edit distance to automatically identify surface taxonomy errors, which were then statistically analysed to produce language features which serve as criterial for a particular language proficiency level. Two case studies will report on different statistical and machine learning techniques; a clustering technique called variability-based neighbour clustering and ensemble learning called random forest. The results of the two case studies show that using edit distance over parallel learner corpora is a promising direction for annotating a large quantity of learner data with minimum manual annotation work, and both statistical techniques were found to be effective in identifying criterial features from learner corpora. Some theoretical and methodological issues are discussed for further research.
This list is based on CrossRef data as of 18 april 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.