Chapter 2
From lexical bundles to surprisal and language models
Measuring the idiom principle in native and learner language
We exploit the information theoretic measure of surprisal to analyze the formulaicity of lexical sequences. We first show the prevalence of individual lexical bundles, then we argue that abstracting to surprisal as an information-theoretic measure of lexical bundleness, formulaicity and non-creativity is an appropriate measure for the idiom principle, as it expresses reader expectations and text entropy. As strong and gradient formulaic, idiomatic and selectional preferences prevail on all levels, we argue for the abstraction step from individual bundles to measures of bundleness. We use surprisal to analyse differences between genres of native language use, and learner language at different levels: (a) spoken and written genres of native language (L1); (b) spoken and written learner language (L2), across selected written genres; (c) learner language as compared with native language (L1). We thus test Pawley and Syder (1983)’s hypothesis that native speakers know best how to play the tug-of-war between formulaicity (Sinclair’s idiom principle) and expressiveness (Sinclair’s open-choice principle), which can be measured with Levy and Jaeger (2007)’s uniform information density (UID) which is a principle of minimizing comprehension difficulty. Our goal to abstract away from word sequences also leads us to language models as models of processing, first in the form of a part-of-speech tagger, then in the form of a syntactic parser. While our hypotheses are largely confirmed, we also observe that advanced learners bundle most, and that scientific language may show lower surprisal than spoken language.
Article outline
-
1.Introduction
- 2.Related research
- 3.Materials
- 4.From frequencies to collocations
- 4.1Frequency as measure of lexical bundleness
- 4.2Collocation measures: O/E and T-score
- 5.Surprisal as a measure of bundleness
- 5.1Method
- 5.2Results
- 5.3Bundleness of spoken L2 compared to corrected L2
- 5.4Bundleness of written L2 compared to L1
-
6.Collocations as non-adjacent relations in a syntactic frame
- 7.Part-of-Speech tagging model
- 8.Parser as a language processing model
- 8.1Method
- 8.2Parser performance
- 8.3Parser model fit
- 9.Conclusions and outlook
-
Notes
-
References
References
Aggarval, Charu C.
2013 Outlier Analysis. Dordrecht: Kluwer.


Altenberg, Bengt & Tapper, Marie
1998 The use of adverbial connectors in advanced Swedish learner’s written English. In
Learner English on Computer,
Sylviane Granger (ed.), 80–93. London: Addison Wesley Longman.

Aston, Guy & Burnard, Lou
1998 The BNC Handbook. Exploring the British National Corpus with SARA. Edinburgh: EUP.

Bartsch, Sabine & Evert, Stefan
2014 Towards a Firthian notion of collocation. In
Vernetzungsstrategien, Zugriffsstrukturen und automatisch ermittelte Angaben in Internetwörterbüchern [
OPAL – Online publizierte Arbeiten zur Linguistik 2/2014],
Andrea Abel &
Lothar Lemnitzer (eds), 48–61. Mannheim: Institut für Deutsche Sprache.

Biber, Douglas
2003 Compressed noun-phrase structures in newspaper discourse: The competing demands of popularization vs. economy. In
New Media Language,
Jean Aitchison &
Diana Lewis (eds), 169–181. London: Routledge.

Biber, Douglas & Barbieri, Federica
2007 Lexical bundles in university spoken and written registers.
English for Specific Purposes 26: 263–286.


Biber, Douglas, Conrad, Susan & Cortes, Viviana
2004 If you look at…: Lexical bundles in university teaching and textbooks.
Applied Linguistics 25: 371–405.


Biber, Douglas, Johansson, Stig, Leech, Geoffrey, Conrad, Susan & Finegan, Edward
1999 Longman Grammar of Spoken and Written English. London: Longman.

Bonk, William J.
2000 Testing ESL learners’ knowledge of collocations. Urbana IL: Clearinghouse.
[URL]
Conrad, Susan & Biber, Douglas
2004 The frequency and use of lexical bundles in conversation and academic prose.
Lexicographica 20: 56–71.

Cheng, Winnie, Greaves, Chris, Sinclair, John McH. & Warren, Martin
2009 Uncovering the extent of the phraseological tendency: Towards a systematic analysis of concgrams.
Applied Linguistics 30(2): 236–252.


Ellis, Nick C.
2002 Frequency effects in language processing.
Studies in Second Language Acquisition 24(2): 143–188.

Ellis, Nick C., Frey, Eric & Jalkanen, Isaac
Ellis, Nick C. & Frey, Eric
Ellis, Nick C., Simpson Vlach, Rita & Maynard, Carson
2008 Formulaic language in native and second language speakers: Psycholinguistics, corpus linguistics, and TESOL.
Tesol Quarterly 42(3): 375–396.


Erman, Britt & Warren, Beatrice
2000 The idiom principle and the open choice principle.
TEXT 20(1): 29–62.


Evert, Stefan
2009 Corpora and collocations. In
Corpus Linguistics. An International Handbook,
Anke Lüdeling &
Merja Kytö (eds), 1212–1248. Berlin: Mouton de Gruyter.


Frank, Stefan L. & Bod, Rens
2011 Insensitivity of the human sentence-processing system to the hierarchical structure.
Psychological Science 22(6): 829–834.


Frank, Stefan L., Fernandez Monsalve, Irene, Thompson, Robin L. & Vigliocco, Gabriella
2013 Reading-time data for evaluating broad-coverage models of English sentence processing Behavior Research Methods 45: 1182–1190


Fossum, Victoria & Levy, Roger
2012 Sequential vs. hierachical models of human incremental sentence processing. In
Proceedings of the 3rd Workshop on Cognitive Modeling and Computational Linguistics (CMCL 2012),
Montreal, Canada,
Roger Levy &
David Reitter (eds), 61–69. Montreal: Association for Computational Linguistics.

Gildea, Daniel
2001 Corpus variation and parser performance. In
Proceedings of the 2001 Conference on Empirical Methods in Natural Language Processing (EMNLP), 167–202, Pittsburgh, PA.

Gries, Stefan T.
2010 Useful statistics for corpus linguistics. In
A Mosaic of Corpus Linguistics: Selected Approaches,
Aquilino Sánchez &
Moisés Almela (eds), 269–291. Frankfurt: Peter Lang.

Granger, Sylviane
2009 Prefabricated patterns in advanced EFL writing: Collocations and formulae. In
Phraseology: Theory, Analysis, and Applications,
Anthony P. Cowie (ed.), 185–204. Tokyo: Kurosio.

Granger, Sylviane, & Tyson, Stephanie
1996 Connector usage in the English essay writing of native and non-native EFL speakers of English.
World Englishes 15(1): 17–27.


Hoey, Michael
2005 Lexical priming: A New Theory of Words and Language. Routledge.


Izumi, Emi, Uchimoto, Kiyotaka & Isahara, Hitoshi
2005 Error annotation for corpus of Japanese learner English.
Proceedings of the Sixth International Workshop on Linguistically Interpreted Corpora (LINC 2005).
[URL]
Ishikawa, Shin
2009 Vocabulary in interlanguage: A study on corpus of English essays written by Asian university students (CEEAUS). In
Phraseology, Corpus Linguistics and Lexicography: Papers from Phraseology 2009 in Japan,
Katsumasa Yagi &
Takaaki Kanzaki (eds), 87–100. Nishinomiya: Kwansei Gakuin University Press.

Kennedy, Chris & Thorp, Dilys
2007 A corpus investigation of linguistic responses to an IELTS Academic Writing task. In
IELTS Collected Papers: Research in Speaking and Writing Assessment,
Linda Taylor &
Peter Falvey (eds), 316–378. Cambridge: CUP.

Kopaczyk, Joanna
2012 Applications of the lexical bundles method in historical corpus research. In
Corpus Data across Languages and Disciplines,
Piotr Pezik (ed.), 83–95. Frankfurt: Peter Lang.

Keller, Frank
2003 A probabilistic parser as a model of global processing difficulty. In
Proceedings of the 25th Annual Conference of the Cognitive Science Society,
Richard Alterman &
David Kirsh (eds), 646–651. Boston MA: Cognitive Science Society.

Keller, Frank
2010 Cognitively plausible models of human language processing. In
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics: Short Papers 11–16 July, 60–67. Uppsala: Uppsala University.

Lee, David Y. W.
2001 Genres, registers, text types, domains and styles: Clarifying the concepts and navigating a path through the bnc jungle.
Language Learning and Technology 5(3): 37–72.

Leech, Geoffrey
2000 Grammars of spoken English: New outcomes of corpus-oriented research.
Language Learning 50(4): 675–724.


Lehmann, Hans Martin & Schneider, Gerold
2011 A large-scale investigation of verb-attached prepositional phrases. In
Studies in Variation, Contacts and Change in English, Vol. 6: Methodological and Historical Dimensions of Corpus Linguistics,
Sebastian Hoffmann,
Paul Rayson &
Geoffrey Leech (eds). Helsinki: Varieng.

Levy, Roger & Jaeger, T. Florian
2007 Speakers optimize information density through syntactic reduction. In
Advances in Neural Information Processing Systems (NIPS) 19,
Bernhard Schlökopf,
John Platt &
Thomas Hoffman (eds), 849–856. Cambridge MA: The MIT Press.

Jaeger, T. Florian
2010 Redundancy and reduction: Speakers manage syntactic information density.
Cognitive Psychology 61(1): 23–62.


Lorenz, Gunter R.
1999 Adjective Intensification – Learners Versus Native Speakers. A Corpus Study of Argumentative Writing. Amsterdam: Rodopi.

Malvern, David D., Richards, Brian J., Chipere, Ngoni & Durán, Pilar
2004 Lexical Diversity and Language Development. Houndmills: Palgrave MacMillan.


Marcus, Mitch, Santorini, Beatrice & Marcinkiewicz, Mary Ann
1993 Building a large annotated corpus of English: The Penn Treebank.
Computational Linguistics 19: 313–330.

McEnery, Tony, Xiao, Richard & Tono, Yukio
2006 Corpus-based Language Studies: An Advanced Resource Book [
Routledge Applied Linguistics Series]. London: Routledge.

Millar, Neil
2011 The processing of malformed learner collocations.
Applied Linguistics 32(2):129–148.


Nattinger, James R.
1980 A lexical phrase-grammar for ESL.
TESOL Quarterly 14(3): 337–344.


Nesselhauf, Nadja
2003 The use of collocations by advanced learners of English and some implications for teaching.
Applied Linguistics 24(2): 223–242.


Ng, Hwee Tou, Wu, Siew Mei, Briscoe, Ted, Hadiwinoto, Christian, Hendy Susanto, Raymond & Bryant, Christoper
(eds) 2014 Proceedings of the Eighteenth Conference on Computational Natural Language Learning: Shared Task. Baltimore MD: Association for Computational Linguistics.


NICT
2012 Japanese Learner English Corpus (JLE, Version 4.1, 2012).
[URL]
Ohlrogge, Aaron
2009 Formulaic expressions in intermediate EFL writing assessment. In
Formulaic Language, Vol. II: Acquisition, Loss, Psychological Reality, and Functional Explanations [
Typological Studies in Language 83],
Roberta Corrigan,
Edith A. Moravcsik,
Hamid Ouali &
Kathleen M. Wheatley (eds), 375–386. Amsterdam: John Benjamins.


Pawley, Andrew & Hodgetts Syder, Frances
1983 Two puzzles for linguistic theory: Native-like selection and native-like fluency. In
Language and Communication,
Jack C. Richards &
Richard W. Schmidt (eds), 191–226. London: Longman.

Pecina, Pavel
2009 Lexical Association Measures: Collocation Extraction [
Studies in Computational and Theoretical Linguistics 4]. Prague: Institute of Formal and Applied Linguistics, Charles University in Prague.

Read, John & Nation, Paul
2006 An investigation of the lexical dimension of the IELTS speaking test. In
IELTS Research Reports, Vol. 6,
Petronella McGovern &
Steve Walsh (eds). IELTS Australia and British Council.
[URL]
Ronan, Patricia & Schneider, Gerold
Schmid, Helmut
1994 Probabilistic part-of-speech tagging using decision trees. In
Proceedings of International Conference on New Methods in Language Processing. Manchester.

Schneider, Gerold
2008 Hybrid Long-Distance Functional Dependency Parsing. PhD dissertation, University of Zurich.

Seretan, Violeta
2011 Syntax-Based Collocation Extraction. Dordrecht: Springer.


Shannon, Claude E.
1951 Prediction and entropy of printed English.
The Bell System Technical Journal 30: 50–64.


Sinclair, John
1991 Corpus, Concordance, Collocation. Oxford: OUP.

Sinclair, John McH. & Mauranen, Anna
Siyanova-Chanturia, Anna & Martinez, Ron
2014 The Idiom Principle revisited.
Applied Linguistics 36(5): 549–569.

Zipf, George Kingsley
1965 The Psycho-Biology of Language: An Introduction to Dynamic Philology. Cambridge MA: The MIT Press.

Zipf, George Kingsley
1949 Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. London: Addison-Wesley.

Cited by
Cited by 2 other publications
Drury, Brett & Samuel Morais Drury
2022.
Lexical Bundle Variation in Business Actors’ Public Communications. In
Text, Speech, and Dialogue [
Lecture Notes in Computer Science, 13502],
► pp. 339 ff.

This list is based on CrossRef data as of 18 november 2023. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.