Chapter published in:Applications of Pattern-driven Methods in Corpus Linguistics
Edited by Joanna Kopaczyk and Jukka Tyrkkö
[Studies in Corpus Linguistics 82] 2018
► pp. 15–56
Chapter 2From lexical bundles to surprisal and language models
Measuring the idiom principle in native and learner language
We exploit the information theoretic measure of surprisal to analyze the formulaicity of lexical sequences. We first show the prevalence of individual lexical bundles, then we argue that abstracting to surprisal as an information-theoretic measure of lexical bundleness, formulaicity and non-creativity is an appropriate measure for the idiom principle, as it expresses reader expectations and text entropy. As strong and gradient formulaic, idiomatic and selectional preferences prevail on all levels, we argue for the abstraction step from individual bundles to measures of bundleness. We use surprisal to analyse differences between genres of native language use, and learner language at different levels: (a) spoken and written genres of native language (L1); (b) spoken and written learner language (L2), across selected written genres; (c) learner language as compared with native language (L1). We thus test Pawley and Syder (1983)’s hypothesis that native speakers know best how to play the tug-of-war between formulaicity (Sinclair’s idiom principle) and expressiveness (Sinclair’s open-choice principle), which can be measured with Levy and Jaeger (2007)’s uniform information density (UID) which is a principle of minimizing comprehension difficulty. Our goal to abstract away from word sequences also leads us to language models as models of processing, first in the form of a part-of-speech tagger, then in the form of a syntactic parser. While our hypotheses are largely confirmed, we also observe that advanced learners bundle most, and that scientific language may show lower surprisal than spoken language.
Keywords: formulaicity, learner’s language, language processing, collocation, part-of-speech tagging, syntactic parsing
- 2.Related research
- 4.From frequencies to collocations
- 4.1Frequency as measure of lexical bundleness
- 4.2Collocation measures: O/E and T-score
- 5.Surprisal as a measure of bundleness
- 5.3Bundleness of spoken L2 compared to corrected L2
- 5.4Bundleness of written L2 compared to L1
- 6.Collocations as non-adjacent relations in a syntactic frame
- 7.Part-of-Speech tagging model
- 8.Parser as a language processing model
- 8.2Parser performance
- 8.3Parser model fit
- 9.Conclusions and outlook
Published online: 13 March 2018
Altenberg, Bengt & Tapper, Marie
Aston, Guy & Burnard, Lou
Bartsch, Sabine & Evert, Stefan
Biber, Douglas & Barbieri, Federica
Biber, Douglas, Conrad, Susan & Cortes, Viviana
Biber, Douglas, Johansson, Stig, Leech, Geoffrey, Conrad, Susan & Finegan, Edward
Bonk, William J.
2000 Testing ESL learners’ knowledge of collocations. Urbana IL: Clearinghouse. http://files.eric.ed.gov/fulltext/ED442309.pdf
Conrad, Susan & Biber, Douglas
Cheng, Winnie, Greaves, Chris, Sinclair, John McH. & Warren, Martin
Ellis, Nick C.
Ellis, Nick C., Frey, Eric & Jalkanen, Isaac
Ellis, Nick C. & Frey, Eric
Ellis, Nick C., Simpson Vlach, Rita & Maynard, Carson
Erman, Britt & Warren, Beatrice
2009 Formulaic language from a learner perspective: What the learner needs to know. In Formulaic Language, Vol. II: Acquisition, Loss, Psychological Reality, and Functional Explanations [Typological Studies in Language 83], Roberta Corrigan, Edith A. Moravcsik, Hamid Ouali & Kathleen M. Wheatley (eds), 323–346. Amsterdam: John Benjamins.
Frank, Stefan L. & Bod, Rens
Frank, Stefan L., Fernandez Monsalve, Irene, Thompson, Robin L. & Vigliocco, Gabriella
Fossum, Victoria & Levy, Roger
Gries, Stefan T.
Granger, Sylviane, & Tyson, Stephanie
Izumi, Emi, Uchimoto, Kiyotaka & Isahara, Hitoshi
2005 Error annotation for corpus of Japanese learner English. Proceedings of the Sixth International Workshop on Linguistically Interpreted Corpora (LINC 2005). http://clair.eecs.umich.edu/aan/paper.php?paper_id=I05-6009#pdf
2009 Vocabulary in interlanguage: A study on corpus of English essays written by Asian university students (CEEAUS). In Phraseology, Corpus Linguistics and Lexicography: Papers from Phraseology 2009 in Japan, Katsumasa Yagi & Takaaki Kanzaki (eds), 87–100. Nishinomiya: Kwansei Gakuin University Press.
Kennedy, Chris & Thorp, Dilys
Lee, David Y. W.
Lehmann, Hans Martin & Schneider, Gerold
Levy, Roger & Jaeger, T. Florian
Jaeger, T. Florian
Lorenz, Gunter R.
Malvern, David D., Richards, Brian J., Chipere, Ngoni & Durán, Pilar
Marcus, Mitch, Santorini, Beatrice & Marcinkiewicz, Mary Ann
McEnery, Tony, Xiao, Richard & Tono, Yukio
Ng, Hwee Tou, Wu, Siew Mei, Briscoe, Ted, Hadiwinoto, Christian, Hendy Susanto, Raymond & Bryant, Christoper
2012 Japanese Learner English Corpus (JLE, Version 4.1, 2012). http://alaginrc.nict.go.jp/nict_jle/index_E.html
2009 Formulaic expressions in intermediate EFL writing assessment. In Formulaic Language, Vol. II: Acquisition, Loss, Psychological Reality, and Functional Explanations [Typological Studies in Language 83], Roberta Corrigan, Edith A. Moravcsik, Hamid Ouali & Kathleen M. Wheatley (eds), 375–386. Amsterdam: John Benjamins.
Pawley, Andrew & Hodgetts Syder, Frances
Read, John & Nation, Paul
2006 An investigation of the lexical dimension of the IELTS speaking test. In IELTS Research Reports, Vol. 6, Petronella McGovern & Steve Walsh (eds). IELTS Australia and British Council. https://www.ielts.org/pdf/Volume%206,%20Report%207.pdf
Ronan, Patricia & Schneider, Gerold
Shannon, Claude E.
Sinclair, John McH. & Mauranen, Anna
Siyanova-Chanturia, Anna & Martinez, Ron
Zipf, George Kingsley
Cited by 1 other publications
This list is based on CrossRef data as of 31 march 2022. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.