Chapter 2
From lexical bundles to surprisal and language models
Measuring the idiom principle in native and learner language
We exploit the information theoretic measure of surprisal to analyze the formulaicity of lexical sequences. We first show the prevalence of individual lexical bundles, then we argue that abstracting to surprisal as an information-theoretic measure of lexical bundleness, formulaicity and non-creativity is an appropriate measure for the idiom principle, as it expresses reader expectations and text entropy. As strong and gradient formulaic, idiomatic and selectional preferences prevail on all levels, we argue for the abstraction step from individual bundles to measures of bundleness. We use surprisal to analyse differences between genres of native language use, and learner language at different levels: (a) spoken and written genres of native language (L1); (b) spoken and written learner language (L2), across selected written genres; (c) learner language as compared with native language (L1). We thus test Pawley and Syder (1983)’s hypothesis that native speakers know best how to play the tug-of-war between formulaicity (Sinclair’s idiom principle) and expressiveness (Sinclair’s open-choice principle), which can be measured with Levy and Jaeger (2007)’s uniform information density (UID) which is a principle of minimizing comprehension difficulty. Our goal to abstract away from word sequences also leads us to language models as models of processing, first in the form of a part-of-speech tagger, then in the form of a syntactic parser. While our hypotheses are largely confirmed, we also observe that advanced learners bundle most, and that scientific language may show lower surprisal than spoken language.
Article outline
-
1.Introduction
- 2.Related research
- 3.Materials
- 4.From frequencies to collocations
- 4.1Frequency as measure of lexical bundleness
- 4.2Collocation measures: O/E and T-score
- 5.Surprisal as a measure of bundleness
- 5.1Method
- 5.2Results
- 5.3Bundleness of spoken L2 compared to corrected L2
- 5.4Bundleness of written L2 compared to L1
-
6.Collocations as non-adjacent relations in a syntactic frame
- 7.Part-of-Speech tagging model
- 8.Parser as a language processing model
- 8.1Method
- 8.2Parser performance
- 8.3Parser model fit
- 9.Conclusions and outlook
-
Notes
-
References
References
Aggarval, Charu C.
2013 Outlier Analysis. Dordrecht: Kluwer.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Altenberg, Bengt & Tapper, Marie
1998 The use of adverbial connectors in advanced Swedish learner’s written English. In
Learner English on Computer,
Sylviane Granger (ed.), 80–93. London: Addison Wesley Longman.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Aston, Guy & Burnard, Lou
1998 The BNC Handbook. Exploring the British National Corpus with SARA. Edinburgh: EUP.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Bartsch, Sabine & Evert, Stefan
2014 Towards a Firthian notion of collocation. In
Vernetzungsstrategien, Zugriffsstrukturen und automatisch ermittelte Angaben in Internetwörterbüchern [
OPAL – Online publizierte Arbeiten zur Linguistik 2/2014],
Andrea Abel &
Lothar Lemnitzer (eds), 48–61. Mannheim: Institut für Deutsche Sprache.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Biber, Douglas
2003 Compressed noun-phrase structures in newspaper discourse: The competing demands of popularization vs. economy. In
New Media Language,
Jean Aitchison &
Diana Lewis (eds), 169–181. London: Routledge.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Biber, Douglas & Barbieri, Federica
2007 Lexical bundles in university spoken and written registers.
English for Specific Purposes 26: 263–286.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Biber, Douglas, Conrad, Susan & Cortes, Viviana
2004 If you look at…: Lexical bundles in university teaching and textbooks.
Applied Linguistics 25: 371–405.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Biber, Douglas, Johansson, Stig, Leech, Geoffrey, Conrad, Susan & Finegan, Edward
1999 Longman Grammar of Spoken and Written English. London: Longman.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Bonk, William J.
2000 Testing ESL learners’ knowledge of collocations. Urbana IL: Clearinghouse.
[URL]
Conrad, Susan & Biber, Douglas
2004 The frequency and use of lexical bundles in conversation and academic prose.
Lexicographica 20: 56–71.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Cheng, Winnie, Greaves, Chris, Sinclair, John McH. & Warren, Martin
2009 Uncovering the extent of the phraseological tendency: Towards a systematic analysis of concgrams.
Applied Linguistics 30(2): 236–252.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Ellis, Nick C.
2002 Frequency effects in language processing.
Studies in Second Language Acquisition 24(2): 143–188.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Ellis, Nick C., Frey, Eric & Jalkanen, Isaac
Ellis, Nick C. & Frey, Eric
Ellis, Nick C., Simpson Vlach, Rita & Maynard, Carson
2008 Formulaic language in native and second language speakers: Psycholinguistics, corpus linguistics, and TESOL.
Tesol Quarterly 42(3): 375–396.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Erman, Britt & Warren, Beatrice
2000 The idiom principle and the open choice principle.
TEXT 20(1): 29–62.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Evert, Stefan
2009 Corpora and collocations. In
Corpus Linguistics. An International Handbook,
Anke Lüdeling &
Merja Kytö (eds), 1212–1248. Berlin: Mouton de Gruyter.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Frank, Stefan L. & Bod, Rens
2011 Insensitivity of the human sentence-processing system to the hierarchical structure.
Psychological Science 22(6): 829–834.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Frank, Stefan L., Fernandez Monsalve, Irene, Thompson, Robin L. & Vigliocco, Gabriella
2013 Reading-time data for evaluating broad-coverage models of English sentence processing Behavior Research Methods 45: 1182–1190
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Fossum, Victoria & Levy, Roger
2012 Sequential vs. hierachical models of human incremental sentence processing. In
Proceedings of the 3rd Workshop on Cognitive Modeling and Computational Linguistics (CMCL 2012),
Montreal, Canada,
Roger Levy &
David Reitter (eds), 61–69. Montreal: Association for Computational Linguistics.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Gildea, Daniel
2001 Corpus variation and parser performance. In
Proceedings of the 2001 Conference on Empirical Methods in Natural Language Processing (EMNLP), 167–202, Pittsburgh, PA.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Gries, Stefan T.
2010 Useful statistics for corpus linguistics. In
A Mosaic of Corpus Linguistics: Selected Approaches,
Aquilino Sánchez &
Moisés Almela (eds), 269–291. Frankfurt: Peter Lang.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Granger, Sylviane
2009 Prefabricated patterns in advanced EFL writing: Collocations and formulae. In
Phraseology: Theory, Analysis, and Applications,
Anthony P. Cowie (ed.), 185–204. Tokyo: Kurosio.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Granger, Sylviane, & Tyson, Stephanie
1996 Connector usage in the English essay writing of native and non-native EFL speakers of English.
World Englishes 15(1): 17–27.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Hoey, Michael
2005 Lexical priming: A New Theory of Words and Language. Routledge.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Izumi, Emi, Uchimoto, Kiyotaka & Isahara, Hitoshi
2005 Error annotation for corpus of Japanese learner English.
Proceedings of the Sixth International Workshop on Linguistically Interpreted Corpora (LINC 2005).
[URL]
Ishikawa, Shin
2009 Vocabulary in interlanguage: A study on corpus of English essays written by Asian university students (CEEAUS). In
Phraseology, Corpus Linguistics and Lexicography: Papers from Phraseology 2009 in Japan,
Katsumasa Yagi &
Takaaki Kanzaki (eds), 87–100. Nishinomiya: Kwansei Gakuin University Press.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Kennedy, Chris & Thorp, Dilys
2007 A corpus investigation of linguistic responses to an IELTS Academic Writing task. In
IELTS Collected Papers: Research in Speaking and Writing Assessment,
Linda Taylor &
Peter Falvey (eds), 316–378. Cambridge: CUP.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Kopaczyk, Joanna
2012 Applications of the lexical bundles method in historical corpus research. In
Corpus Data across Languages and Disciplines,
Piotr Pezik (ed.), 83–95. Frankfurt: Peter Lang.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Keller, Frank
2003 A probabilistic parser as a model of global processing difficulty. In
Proceedings of the 25th Annual Conference of the Cognitive Science Society,
Richard Alterman &
David Kirsh (eds), 646–651. Boston MA: Cognitive Science Society.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Keller, Frank
2010 Cognitively plausible models of human language processing. In
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics: Short Papers 11–16 July, 60–67. Uppsala: Uppsala University.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Lee, David Y. W.
2001 Genres, registers, text types, domains and styles: Clarifying the concepts and navigating a path through the bnc jungle.
Language Learning and Technology 5(3): 37–72.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Leech, Geoffrey
2000 Grammars of spoken English: New outcomes of corpus-oriented research.
Language Learning 50(4): 675–724.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Lehmann, Hans Martin & Schneider, Gerold
2011 A large-scale investigation of verb-attached prepositional phrases. In
Studies in Variation, Contacts and Change in English, Vol. 6: Methodological and Historical Dimensions of Corpus Linguistics,
Sebastian Hoffmann,
Paul Rayson &
Geoffrey Leech (eds). Helsinki: Varieng.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Levy, Roger & Jaeger, T. Florian
2007 Speakers optimize information density through syntactic reduction. In
Advances in Neural Information Processing Systems (NIPS) 19,
Bernhard Schlökopf,
John Platt &
Thomas Hoffman (eds), 849–856. Cambridge MA: The MIT Press.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Jaeger, T. Florian
2010 Redundancy and reduction: Speakers manage syntactic information density.
Cognitive Psychology 61(1): 23–62.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Lorenz, Gunter R.
1999 Adjective Intensification – Learners Versus Native Speakers. A Corpus Study of Argumentative Writing. Amsterdam: Rodopi.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Malvern, David D., Richards, Brian J., Chipere, Ngoni & Durán, Pilar
2004 Lexical Diversity and Language Development. Houndmills: Palgrave MacMillan.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Marcus, Mitch, Santorini, Beatrice & Marcinkiewicz, Mary Ann
1993 Building a large annotated corpus of English: The Penn Treebank.
Computational Linguistics 19: 313–330.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
McEnery, Tony, Xiao, Richard & Tono, Yukio
2006 Corpus-based Language Studies: An Advanced Resource Book [
Routledge Applied Linguistics Series]. London: Routledge.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Millar, Neil
2011 The processing of malformed learner collocations.
Applied Linguistics 32(2):129–148.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Nattinger, James R.
1980 A lexical phrase-grammar for ESL.
TESOL Quarterly 14(3): 337–344.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Nesselhauf, Nadja
2003 The use of collocations by advanced learners of English and some implications for teaching.
Applied Linguistics 24(2): 223–242.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Ng, Hwee Tou, Wu, Siew Mei, Briscoe, Ted, Hadiwinoto, Christian, Hendy Susanto, Raymond & Bryant, Christoper
(eds) 2014 Proceedings of the Eighteenth Conference on Computational Natural Language Learning: Shared Task. Baltimore MD: Association for Computational Linguistics.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
NICT
2012 Japanese Learner English Corpus (JLE, Version 4.1, 2012).
[URL]
Ohlrogge, Aaron
2009 Formulaic expressions in intermediate EFL writing assessment. In
Formulaic Language, Vol. II: Acquisition, Loss, Psychological Reality, and Functional Explanations [
Typological Studies in Language 83],
Roberta Corrigan,
Edith A. Moravcsik,
Hamid Ouali &
Kathleen M. Wheatley (eds), 375–386. Amsterdam: John Benjamins.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Pawley, Andrew & Hodgetts Syder, Frances
1983 Two puzzles for linguistic theory: Native-like selection and native-like fluency. In
Language and Communication,
Jack C. Richards &
Richard W. Schmidt (eds), 191–226. London: Longman.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Pecina, Pavel
2009 Lexical Association Measures: Collocation Extraction [
Studies in Computational and Theoretical Linguistics 4]. Prague: Institute of Formal and Applied Linguistics, Charles University in Prague.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Read, John & Nation, Paul
2006 An investigation of the lexical dimension of the IELTS speaking test. In
IELTS Research Reports, Vol. 6,
Petronella McGovern &
Steve Walsh (eds). IELTS Australia and British Council.
[URL]
Ronan, Patricia & Schneider, Gerold
Schmid, Helmut
1994 Probabilistic part-of-speech tagging using decision trees. In
Proceedings of International Conference on New Methods in Language Processing. Manchester.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Schneider, Gerold
2008 Hybrid Long-Distance Functional Dependency Parsing. PhD dissertation, University of Zurich.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Seretan, Violeta
2011 Syntax-Based Collocation Extraction. Dordrecht: Springer.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Shannon, Claude E.
1951 Prediction and entropy of printed English.
The Bell System Technical Journal 30: 50–64.
![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Sinclair, John
1991 Corpus, Concordance, Collocation. Oxford: OUP.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Sinclair, John McH. & Mauranen, Anna
Siyanova-Chanturia, Anna & Martinez, Ron
2014 The Idiom Principle revisited.
Applied Linguistics 36(5): 549–569.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Zipf, George Kingsley
1965 The Psycho-Biology of Language: An Introduction to Dynamic Philology. Cambridge MA: The MIT Press.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Zipf, George Kingsley
1949 Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. London: Addison-Wesley.
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Cited by
Cited by 2 other publications
Drury, Brett & Samuel Morais Drury
2022.
Lexical Bundle Variation in Business Actors’ Public Communications. In
Text, Speech, and Dialogue [
Lecture Notes in Computer Science, 13502],
► pp. 339 ff.
![DOI logo](//benjamins.com/logos/doi-logo.svg)
This list is based on CrossRef data as of 26 june 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.