Statistical sequence and parsing models for descriptive linguistics and psycholinguistics

Schneider, Gerold; Grigonyté, Gintaré

doi:10.1075/slcs.177.11sch

Part of

New Approaches to English Linguistics: Building bridges
Edited by Olga Timofeeva, Anne-Christine Gardner, Alpo Honkapohja and Sarah Chevalier
[Studies in Language Companion Series 177] 2016
► pp. 281–320

Statistical sequence and parsing models for descriptive linguistics and psycholinguistics

Gerold Schneider | University of Zurich | University of Konstanz

Gintaré Grigonyté | University of Stockholm

This study shows that using computational linguistic models is beneficial for descriptive linguistics and psycholinguistics. It applies two models to various English genres and learner language: 1) surprisal and 2) a syntactic parser, allowing us to investigate the role of ambiguity and the interplay between idiom and syntax principles. We find that surprisal and ambiguity are higher for learner language, while parser scores and model fit are lower. In addition, the random application of alternations leads to more ambiguous sentences. Failures to generate optimal orderings in the sense of relevance theory, such as nonnative-like utterances by language learners exhibit, increase processing load, both for human and automatic processors. As human and automatic parsing difficulties correlate, we suggest syntactic parsers as psycholinguistic processing models.

Keywords: language processing, statistical models, idiom and syntax principle, ambiguity, syntactic parsing

Article outline

1.Introduction
2.Background and motivation: Language models
- 2.1A case for statistical language models in linguistics
  - 2.1.1Significance tests are not enough
    - 2.1.1.1Assumption of random distribution
    - 2.1.1.2Assumption of independence from other factors
    - 2.1.1.3Assumption of free choice
  - 2.1.2The envelope of variation
  - 2.1.3Binary local decisions
- 2.2Models for natural language processing
  - 2.2.1N-gram models and the idiom principle
  - 2.2.2Syntactic models: Distributed interdependent decisions
    - 2.2.2.1Ambiguity
    - 2.2.2.2The idiom and syntax principle in a tug-of-war
    - 2.2.2.3Cognitive plausibility
    - 2.2.2.4Model parameters
    - 2.2.2.5Local and Global Models in Interaction
- 2.3L1 and L2 data
3.Data and methodology
- 3.1Data
- 3.2Surprisal and UID
- 3.3High levels of residuals and low model fit of parsers as indicator
4.Results: Two language processing models
- 4.1Surprisal at the level of word sequences
- 4.2Syntactic parser as a processing model
  - 4.2.1Parser accuracy
  - 4.2.2Parser model fit
5.Ambiguity
- 5.1Garden-path sentences
- 5.2Avoidance of ambiguity
- 5.3Forcing rare constituent order and alternative lexis
6.Conclusions
Notes
References

Published online: 1 November 2016

https://doi.org/10.1075/slcs.177.11sch

References (75)

References

Aggarval, Charu C. 2013. Outlier Analysis. Dordrecht: Kluwer.

Altenberg, Bengt & Tapper, Marie. 1998. The use of adverbial connectors in advanced Swedish learner’s written English. In Learner English on Computer [Studies in Language and Linguistics], Sylviane Granger (ed.), 80–93. Harlow: Addison Wesley Longman.

Arppe, Antti, Gilquin, Gaëtanelle, Glynn, Dylan, Hilpert, Martin & Zeschel, Arne. 2010. Cognitive corpus linguistics: Five points of debate on current theory and methodology. Corpora 5(1): 1–27.

Behaghel, Otto. 1930. Von deutscher Wortstellung (On German word order). Zeitschrift für Deutschkunde, Zeitschrift für deutschen Unterricht (44): 81–89.

Borensztajn, Gideon, Zuidema, Willem & Bod, Rens. 2009. Children’s grammars grow more abstract with age-evidence from an automatic procedure for identifying the productive units of language. Topics in Cognitive Science 1(1): 175–188.

Bod, Rens, Scha, Remko & Sima’an, Khalil (eds). 2003. Data-Oriented Parsing [Center for the Study of Language and Information, Studies in Computational Linguistics (CSLI-SCL)]. Chicago IL: Chicago University Press.

Bresnan, Joan, Cueni, Anna, Nikitina, Tatiana & Baayen, Harald. 2007. Predicting the dative alternation. In Cognitive Foundations of Interpretation, Gosse Boume, Irene Kraemer & Joost Zwarts (eds), 69–94. Amsterdam: Royal Netherlands Academy of Science.

Bresnan, Joan & Nikitina, Tatiana. 2009. The gradience of the dative alternation. In Reality Exploration and Discovery: Pattern Interaction in Language and Life, Linda Uyechi & Lian Hee Wee (eds), 161–184. Stanford CA: CSLI.

Buchholz, Sabine. 2002. Memory-Based Grammatical Relation Finding. PhD dissertation, University of Tilburg.

Bybee, Joan. 2006. From usage to grammar: The mind’s response to repetition. Language 82(4): 711–733.

. 2007. Frequency of Use and the Organization of Language. Oxford: OUP.

Carroll, John, Minnen, Guido & Briscoe, Edward. 2003. Parser evaluation: using a grammatical relation annotation scheme. In Treebanks: Building and Using Parsed Corpora, Anne Abeillé (ed.), 299–316. Dordrecht: Kluwer.

Collins, Michael. 1999. Head-Driven Statistical Models for Natural Language Parsing. PhD dissertation, University of Pennsylvania.

Conklin, Kathy & Schmitt, Norbert. 2012. The processing of formulaic language. Annual Review of Applied Linguistics 32: 45–61.

Church, Kenneth. 2000. Empirical estimates of adaptation: The chance of two Noriegas is closer to p/2 than to p² . Proceedings of the 17th Conference on Computational Linguistics COLING, Vol. 1, 180–186.

Demberg, Vera, Keller, Frank & Alexander Koller. 2013. Parsing with psycholinguistically motivated tree-adjoining grammar. Computational Linguistics 39(4): 1025–1066.

Ellis, Nick C. 2012. Formulaic language and second language acquisition: Zipf and the phrasal Teddy Bear. Annual Review of Applied Linguistics 32: 17–44.

Evert, Stefan. 2006. How random is a corpus? The library metaphor. Zeitschrift für Anglistik und Amerikanistik 54(2): 177–190.

Federico, Marcello & Cettolo, Mauro. 2007. Efficient handling of N-gram language models for statistical machine translation. In Proceedings of the Second Workshop on Statistical Machine Translation, Chris Callison-Burch, Philipp Koehn, Christof Monz, & Cameron Shaw Fordyce (eds), 88–95. Prague: Association for Computational Linguistics.

Francis, Gill. 1993. A corpus-driven approach to grammar – principles, methods and examples. In Text and Technology, Mona Baker, Gill Francis & Elena Tognini-Bonelli (eds), 137–156. Amsterdam: John Benjamins.

Granger, Sylviane. 2009. Prefabricated patterns in advanced EFL writing: Collocations and formulae. In Phraseology: Theory, Analysis, and Applications, Anthony Paul Cowie (ed.), 185–204. Tokyo: Kurosio.

Green, Matthew J. 2014. An eye-tracking evaluation of some parser complexity metrics. In Proceedings of the 3rd Workshop on Predicting and Improving Text Readability for Target Reader Populations (PITR), Sandra Williams, Advaith Siddharthan & Anni Nenkova (eds), 38–46. Stroudsburg PA: Association for Computational Linguistics.

Grice, Paul. 1975. Logic and conversation. In Syntax and Semantics, 3: Speech Acts, Peter Cole & Jerry Morgan (eds), 41–58. New York NY: Academic Press.

Gries, Stefan T. 2006. Exploring variability within and between corpora: Some methodological considerations. Corpora 1(2): 109–151.

2010. Methodological skills in corpus linguistics: A polemic and some pointers towards quantitative methods. In Corpus Linguistics in Language Teaching, Tony Harris & María Moreno Jaén (eds), 121–146. Frankfurt: Peter Lang.

2012. Corpus linguistics, theoretical linguistics, and cognitive/psycholinguistics: Towards more and more fruitful exchanges. In Corpus Linguistics and Variation in English: Theory and Description, Joybrato Mukherjee & Magnus Huber (eds), 41–63. Amsterdam: Rodopi.

In press. Quantitative designs and statistical techniques. In The Cambridge Handbook of Corpus Linguistics, Douglas Biber & Randi Reppen (eds). Cambridge: CUP.

Hawkins, John A. 1994. A Performance Theory of Order and Constituency. Cambridge: CUP.

Hoey, Michael. 2005. Lexical Priming: A New Theory of Words and Language. New York NY: Routledge.

Hundt, Marianne, Schneider, Gerold & Seoane, Elena. 2016. The use of the be-passive in academic Englishes: Local vs. global usage in an international language. Corpora 11(1): 31–63.

Hunston, Susan & Francis, Gill. 2000. A Corpus-Driven Approach to the Lexical Grammar of English [Studies in Corpus Linguistics 4]. Amsterdam: John Benjamins.

Izumi, Emi, Uchimoto, Kiyotaka & Isahara, Hitoshi. 2005. Error annotation for corpus of Japanese learner English. In Proceedings of the Sixth International Workshop on Linguistically Interpreted Corpora, Kyonghee Paik, Francis Bond & Stephan Oepen (eds), 71–80. Jeju: Asian Federation of Natural Language Processing.

Ishikawa, Shin. 2009. Vocabulary in interlanguage: A study on corpus of English essays written by Asian university students (CEEAUS). In Phraseology, Corpus Linguistics and Lexicography: Papers from Phraseology 2009 in Japan, Katsumasa Yagi & Takaaki Kanzaki (eds), 87–100. Nishinomiya: Kwansei Gakuin University Press.

Jaeger, Tim Florian. 2010. Redundancy and reduction: Speakers manage syntactic information density. Cognitive Psychology 61(1): 23–62.

Jucker, Andreas H. 1993. The genitive versus the of-construction in newspaper language. In The Noun Phrase in English: Its Structure and Variability, Andreas H. Jucker (ed.), 121–136. Heidelberg: Universitätsverlag Winter.

Keller, Frank. 2003. A probabilistic parser as a model of global processing difficulty. In Proceedings of the 25th Annual Conference of the Cognitive Science Society, Richard Alterman & David Kirsh (eds), 646–651. Boston MA: Cognitive Science Society.

. 2010. Cognitively plausible models of human language processing. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics: Short Papers, Min-Yen Kang (ed.), 60–67. Stroudsburg PA: Association for Computational Linguistics.

Kreyer, Rolf. 2003. Genitive and of-construction in modern written English: Processability and human involvement. International Journal of Corpus Linguistics 8(2): 169–207.

. 2010. Introduction to English Syntax. Textbooks in English Language and Linguistics. Frankfurt: Peter Lang.

Koehn, Philipp & Hoang, Hieu. 2007. Factored translation models. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Min-Yen Kang (ed.), 868–876. Stroudsburg PA: Association for Computational Linguistics.

Labov, William. 1969. Contraction, deletion, and inherent variability of the English copula. Language 45(4): 715–762.

Leech, Geoffrey, Hundt, Marianne, Mair, Christian & Smith, Nicholas. 2009. Change in Contemporary English: A Grammatical Study. Cambridge: CUP.

Lehmann, Hans Martin & Schneider, Gerold. 2012. Syntactic variation and lexical preference in the dative-shift alternation. In Studies in Variation, Contacts and Change in English, Papers from the 31st International conference on English language research on computerized corpora (ICAME 31) Giessen, Germany, Joybrato Mukherjee & Magnus Huber (eds), 65–75. Amsterdam: Rodopi.

Levin, Beth C. 1993. English Verb Classes and Alternations: A Preliminary Investigation. Chicago IL: University of Chicago Press.

Levy, Roger & Jaeger, T. Florian. 2007. Speakers optimize information density through syntactic reduction. In Advances in Neural Information Processing Systems (NIPS) 19, Bernhard Schlökopf, John Platt & Thomas Hoffman (eds), 849–856. Cambridge MA: The MIT Press.

Marcus, Mitch, Santorini, Beatrice & Marcinkiewicz, Mary Ann. 1993. Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics 19: 313–330.

Mariño, José, Banches, Rafael E., Crego, Josep M., de Gispert, Adrià Lambert, Patrik, Fonollosa, José A. R. & Costa-jussà, Marta R. 2006. N-gram-based machine translation. Computational Linguistics 32(4): 527–549.

Mel’čuk, Igor. 1998. Collocations and lexical functions. In Phraseology: Theory, Analysis, and Applications, Anthon Paul Cowie (ed.), 23–53. Oxford: Clarendon.

Meseguer, Enrique, Carreiras, Manuel & Clifton, Charles. 2002. Overt reanalysis strategies and eye movements during the reading of mild garden path sentences. Memory & Cognition 30(4): 551–561.

Millar, Neil. 2011. The processing of malformed learner collocations. Applied Linguistics 32(2): 129–148.

Mukherjee, Joybrato. 2005. English Ditransitive Verbs: Aspects of Theory, Description and a Usage-based Model. Amsterdam: Rodopi.

Newell, Allen. 1990. Unified Theories of Cognition. Cambridge MA: Harvard University Press.

Ng, Hwee Tou, Wu, Siew Mei, Briscoe, Ted, Hadiwinoto, Christian, Susanto, Raymond Hendy & Bryant, Christopher (eds). 2014. Proceedings of the Eighteenth Conference on Computational Natural Language Learning: Shared Task. <[URL]> (12 February 2016).

Pawley, Andrew & Syder, Frances Hodgetts. 1983. Two puzzles for linguistic theory: Native-like selection and native-like fluency. In Language and Communication, Jack. C. Richards & Richard. W. Schmidt (eds), 191–226. London: Longman.

Rohdenburg, Günter & Mondorf, Britta (eds). 2003. Determinants of Grammatical Variation in English [Topics in English Linguistics 43]. Berlin: Mouton de Gruyter.

Rosenbach, Anette. 2002. Genitive Variation in English. Conceptual Factors in Synchronic and Diachronic Studies. Berlin: Mouton de Gruyter.

Röthlisberger, Melanie & Schneider, Gerold. 2013. Of-genitive versus s-genitive: A corpus-based analysis of possessive constructions in 20th-century English. In New Methods in Historical Corpora [Corpus Linguistics and Interdisciplinary Perspectives on Language 3], Paul Bennet, Martin Durrell, Silke Scheible & Richard J. Whitt (eds), 163–180. Tübingen: Narr.

Sankoff, David. 1988. Sociolinguistics and syntactic variation. In Linguistics: The Cambridge Survey, Vol. 4: Language: The Socio-Cultural Context, Frederik J. Newmeyer (ed.), 140–161. Cambridge: CUP.

Schneider, Gerold, Rinaldi, Fabio, Kaljurand, Kaarel & Hess, Michael. 2005. Closing the gap: Cognitively adequate, fast broad-coverage grammatical role parsing. In ICEIS Workshop on Natural Language Understanding and Cognitive Science (NLUCS 2005). Miami FL.

Schneider, Gerold. 2008. Hybrid Long-distance Functional Dependency Parsing. PhD dissertation, University of Zurich.

. 2012. Using semantic resources to improve a syntactic dependency parser. In SEM-II workshop at LREC 2012, Viktor Pekar, Verginica Barbu Mititelu & Octavian Popescu (eds), 67–76. Istanbul.

Schneider, Gerold & Hundt, Marianne. 2012. “Off with their heads”: Profiling TAM in ICE corpora. In Mapping Unity and Diversity World-wide [Varieties of English Around the World 43], Marianne Hundt & Ulrike Gut (eds), 1–34. Amsterdam: John Benjamins.

Seidenberg, Mark & MacDonald, Maryellen. 1999. A probabilistic constraints approach to language acquisition and processing. Cognitive Science 23(4): 569–588.

Sennrich, Rico, 2013. Domain Adaptation for Translation Models in Statistical Machine Translation. PhD dissertation, University of Zurich.

Seoane, Elena. 2009. Syntactic complexity, discourse status and animacy as determinants of grammatical variation in modern English. English Language and Linguistics 13(3): 365–384.

Sinclair, John. 1991. Corpus, Concordance, Collocation. Oxford: OUP.

. 2008. The phrase, the whole phrase and nothing but the phrase. In Phraseology: An Interdisciplinary Perspective, Sylviane Granger & Fanny Meunier (eds), 407–410. Amsterdam: John Benjamins.

Siyanova-Chanturia, Anna & Martinez, Ron. 2014. The Idiom Principle revisited. Applied Linguistics 36(5): 549–569.

Sperber, Dan & Wilson, Deirdre. 1995. Relevance: Communication and Cognition, 2nd edn. Oxford: Blackwell.

. 2002. Pragmatics, modularity and mind-reading. Mind and Language 17(1): 3–33.

Szmrecsanyi, Benedikt. 2006. Morphosyntactic Persistence in Spoken English: A Corpus Study at the Intersection of Variationist Sociolinguistics, Psycholinguistics, and Discourse Analysis. Berlin: Mouton de Gruyter.

Tomasello, Michael. 2000. The item based nature of children’s early syntactic development. Trends in Cognitive Sciences 4: 156–163.

Wasow, Thomas. 1997. Remarks on grammatical weight. Language Variation and Change 9: 81–105.

Wasow, Thomas & Arnold, Jennifer. 2003. Post-verbal constituent ordering in English. In Determinants of Grammatical Variation in English [Topics in English Linguistics 43], Guenter Rohdenburg & Britta Mondorf (eds), 119–154. Berlin: Mouton de Gruyter.

Wray, Alison. 2002. Formulaic Language and the Lexicon. Cambridge: CUP.