Chapter published in:Applications of Pattern-driven Methods in Corpus Linguistics
Edited by Joanna Kopaczyk and Jukka Tyrkkö
[Studies in Corpus Linguistics 82] 2018
► pp. 15–56
From lexical bundles to surprisal and language models
Measuring the idiom principle in native and learner language
We exploit the information theoretic measure of surprisal to analyze the formulaicity of lexical sequences. We first show the prevalence of individual lexical bundles, then we argue that abstracting to surprisal as an information-theoretic measure of lexical bundleness, formulaicity and non-creativity is an appropriate measure for the idiom principle, as it expresses reader expectations and text entropy. As strong and gradient formulaic, idiomatic and selectional preferences prevail on all levels, we argue for the abstraction step from individual bundles to measures of bundleness. We use surprisal to analyse differences between genres of native language use, and learner language at different levels: (a) spoken and written genres of native language (L1); (b) spoken and written learner language (L2), across selected written genres; (c) learner language as compared with native language (L1). We thus test Pawley and Syder (1983)’s hypothesis that native speakers know best how to play the tug-of-war between formulaicity (Sinclair’s idiom principle) and expressiveness (Sinclair’s open-choice principle), which can be measured with Levy and Jaeger (2007)’s uniform information density (UID) which is a principle of minimizing comprehension difficulty. Our goal to abstract away from word sequences also leads us to language models as models of processing, first in the form of a part-of-speech tagger, then in the form of a syntactic parser. While our hypotheses are largely confirmed, we also observe that advanced learners bundle most, and that scientific language may show lower surprisal than spoken language.
Keywords: formulaicity, learner’s language, language processing, collocation, part-of-speech tagging, syntactic parsing
Published online: 13 March 2018
Altenberg, Bengt & Tapper, Marie
Aston, Guy & Burnard, Lou
Bartsch, Sabine & Evert, Stefan
Biber, Douglas & Barbieri, Federica
Biber, Douglas, Conrad, Susan & Cortes, Viviana
Biber, Douglas, Johansson, Stig, Leech, Geoffrey, Conrad, Susan & Finegan, Edward
Bonk, William J.
2000 Testing ESL learners’ knowledge of collocations. Urbana IL: Clearinghouse. http://files.eric.ed.gov/fulltext/ED442309.pdf
Conrad, Susan & Biber, Douglas
Cheng, Winnie, Greaves, Chris, Sinclair, John McH. & Warren, Martin
Ellis, Nick C.
Ellis, Nick C., Frey, Eric & Jalkanen, Isaac
Ellis, Nick C. & Frey, Eric
Ellis, Nick C., Simpson Vlach, Rita & Maynard, Carson
Erman, Britt & Warren, Beatrice
2009 Formulaic language from a learner perspective: What the learner needs to know. In Formulaic Language, Vol. II: Acquisition, Loss, Psychological Reality, and Functional Explanations [Typological Studies in Language 83], Roberta Corrigan, Edith A. Moravcsik, Hamid Ouali & Kathleen M. Wheatley (eds), 323–346. Amsterdam: John Benjamins.
Frank, Stefan L. & Bod, Rens
Frank, Stefan L., Fernandez Monsalve, Irene, Thompson, Robin L. & Vigliocco, Gabriella
Fossum, Victoria & Levy, Roger 2012 Sequential vs. hierachical models of human incremental sentence processing. In Proceedings of the 3rd Workshop on Cognitive Modeling and Computational Linguistics (CMCL 2012), Montreal, Canada, Roger Levy & David Reitter (eds), 61–69. Montreal: Association for Computational Linguistics.
Gries, Stefan T.
Granger, Sylviane, & Tyson, Stephanie
Izumi, Emi, Uchimoto, Kiyotaka & Isahara, Hitoshi
2005 Error annotation for corpus of Japanese learner English. Proceedings of the Sixth International Workshop on Linguistically Interpreted Corpora (LINC 2005). http://clair.eecs.umich.edu/aan/paper.php?paper_id=I05-6009#pdf
2009 Vocabulary in interlanguage: A study on corpus of English essays written by Asian university students (CEEAUS). In Phraseology, Corpus Linguistics and Lexicography: Papers from Phraseology 2009 in Japan, Katsumasa Yagi & Takaaki Kanzaki (eds), 87–100. Nishinomiya: Kwansei Gakuin University Press.
Kennedy, Chris & Thorp, Dilys
Lee, David Y. W.
Lehmann, Hans Martin & Schneider, Gerold 2011 A large-scale investigation of verb-attached prepositional phrases. In Studies in Variation, Contacts and Change in English, Vol. 6: Methodological and Historical Dimensions of Corpus Linguistics, Sebastian Hoffmann, Paul Rayson & Geoffrey Leech (eds). Helsinki: Varieng.
Levy, Roger & Jaeger, T. Florian
Jaeger, T. Florian
Lorenz, Gunter R.
Malvern, David D., Richards, Brian J., Chipere, Ngoni & Durán, Pilar
Marcus, Mitch, Santorini, Beatrice & Marcinkiewicz, Mary Ann
McEnery, Tony, Xiao, Richard & Tono, Yukio
Ng, Hwee Tou, Wu, Siew Mei, Briscoe, Ted, Hadiwinoto, Christian, Hendy Susanto, Raymond & Bryant, Christoper
2012 Japanese Learner English Corpus (JLE, Version 4.1, 2012). http://alaginrc.nict.go.jp/nict_jle/index_E.html
2009 Formulaic expressions in intermediate EFL writing assessment. In Formulaic Language, Vol. II: Acquisition, Loss, Psychological Reality, and Functional Explanations [Typological Studies in Language 83], Roberta Corrigan, Edith A. Moravcsik, Hamid Ouali & Kathleen M. Wheatley (eds), 375–386. Amsterdam: John Benjamins.
Pawley, Andrew & Hodgetts Syder, Frances
Read, John & Nation, Paul
2006 An investigation of the lexical dimension of the IELTS speaking test. In IELTS Research Reports, Vol. 6, Petronella McGovern & Steve Walsh (eds). IELTS Australia and British Council. https://www.ielts.org/pdf/Volume%206,%20Report%207.pdf
Ronan, Patricia & Schneider, Gerold
Shannon, Claude E.
Sinclair, John McH. & Mauranen, Anna
Siyanova-Chanturia, Anna & Martinez, Ron
Zipf, George Kingsley