Computational extraction of formulaic sequences from corpora
Two case studies of a new extraction algorithm
Alexander Wahl |
Donders Institute for Brain, Cognition and Behaviour, Radboud
University
Stefan Th. Gries |
University of California Santa Barbara
|
Justus Liebig University
We describe a new algorithm for the extraction of formulaic
language from corpora. Entitled MERGE (Multi-word Expressions from the
Recursive Grouping of Elements), it iteratively combines adjacent bigrams
into progressively longer sequences based on lexical association strengths.
We then provide empirical evidence for this approach via two case studies.
First, we compare the performance of MERGE to that of another algorithm by
examining the outputs of the approaches compared with manually annotated
formulaic sequences from the spoken component of the British National
Corpus. Second, we employ two child language corpora to examine whether
MERGE can predict the formulas that the children learn based on caregiver
input. Ultimately, we show that MERGE indeed performs well, offering a
powerful approach for the extraction of formulas.
(1998) On the phraseology of spoken English: The evidence of recurrent word-combinations. In A. P. Cowie (Ed.), Phraseology: Theory, analysis, and applications (pp. 101–102). Oxford: Oxford University Press.
Arnon, I., & Snider, N.
(2010) More than words: Frequency effects for multi-word phrases. Journal of Memory and Language 6, 67–82.
Bannard, C., Lieven, E., & Tomasello, M.
(2009) Modeling children’s early grammatical knowledge. Proceedings of the National Academy of Science 106(41), 17284–17289.
Biber, D., Conrad, S., & Cortes, V.
(2004) If you look at …: Lexical bundles in university teaching and textbooks. Applied Linguistics, 25(3), 371–405.
Bod, R.
(2009) From exemplar to grammar: A probabilistic analogy-based model of language learning. Cognitive Science, 33(5), 752–793.
Bolinger, D.
(1976) Meaning and memory. Forum Linguisticum 1, 1–14.
Bybee, J.
(2010) Language, usage, and cognition. Cambridge: Cambridge University Press.
(1993) Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), 61–74.
Ellis, N. C., Simpson-Vlach, R., & Maynard, C.
(2008) Formulaic language in native and second language speakers: Psycholinguistics, corpus linguistics, and TESOL. TESOL Quarterly, 42(3), 375–396.
Erman, B., & Warren, B.
(2000) The idiom principle and the open choice principle. Text, 20(1), 29–62.
Evert, S.
(2004) The statistics of word co-occurrences: Word pairs and collocations. (PhD Thesis, Universität Stuttgart).
Evert, S.
(2009) Corpora and collocations. In A. Lüdeling, & M. Kytö (Eds.), Corpus linguistics: an international handbook, Vol. 2 (pp. 1212–1248). Berlin/New York: Mouton de Gruyter.
Goldberg, A. E.
(1995) Constructions: a construction grammar approach to argument structure. Chicago: University of Chicago Press.
Goldberg, A. E.
(2006) Constructions at work. Oxford: Oxford University Press.
(2018) Multi-word expressions: A novel computational approach to their bottom-up statistical extraction. In P. L. Cantos-Gómez and M. Almela-Sánchez (Eds.), Lexical collocation analysis: advances and applications (pp. 85–109). Berlin/New York: Springer.
Wible, D., Kuo, C., Chen, M., Tsao, N., & Hung, T.
(2006) A computational approach to the discovery and representation of lexical chunks. TALN 2006 Leuven, Belgium.
Wray, A.
(2002) Formulaic language and the lexicon. Cambridge: Cambridge University Press.
This list is based on CrossRef data as of 19 november 2023. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.