Computational extraction of formulaic sequences from corpora
Two case studies of a new extraction algorithm
Stefan Th. Gries | University of California Santa Barbara | Justus Liebig University
We describe a new algorithm for the extraction of formulaic
language from corpora. Entitled MERGE (Multi-word Expressions from the
Recursive Grouping of Elements), it iteratively combines adjacent bigrams
into progressively longer sequences based on lexical association strengths.
We then provide empirical evidence for this approach via two case studies.
First, we compare the performance of MERGE to that of another algorithm by
examining the outputs of the approaches compared with manually annotated
formulaic sequences from the spoken component of the British National
Corpus. Second, we employ two child language corpora to examine whether
MERGE can predict the formulas that the children learn based on caregiver
input. Ultimately, we show that MERGE indeed performs well, offering a
powerful approach for the extraction of formulas.
Article outline
- 1.Introduction
- 1.1Counting co-occurrences
- 1.2N-Gram sizes/configurations and the problem of redundancy
- 1.3Recent approaches
- 2.The MERGE algorithm
- 3.Case study 1: MERGE vs. AFL
- 3.1Materials
- 3.2Results
- 3.3Interim conclusions
- 4.Case study 2: Exploring MERGE in the context of L1 acquisition
- 4.1Materials and methods
- 4.2Results
- 4.3Discussion
- 5.Conclusion
-
Notes
-
References
References (32)
References
Altenberg, B. (1998). On the phraseology of spoken English: The evidence of recurrent word-combinations. In A. P. Cowie (Ed.), Phraseology: Theory, analysis, and applications (pp. 101–102). Oxford: Oxford University Press.
Arnon, I., & Snider, N. (2010). More than words: Frequency effects for multi-word phrases. Journal of Memory and Language 6, 67–82.
Bannard, C., Lieven, E., & Tomasello, M. (2009). Modeling children’s early grammatical knowledge. Proceedings of the National Academy of Science 106(41), 17284–17289.
Biber, D., Conrad, S., & Cortes, V., (2004). If you look at …: Lexical bundles in university teaching and textbooks. Applied Linguistics, 25(3), 371–405.
Bod, R. (2009). From exemplar to grammar: A probabilistic analogy-based model of language learning. Cognitive Science, 33(5), 752–793.
Bolinger, D. (1976). Meaning and memory. Forum Linguisticum 1, 1–14.
Bybee, J. (2010). Language, usage, and cognition. Cambridge: Cambridge University Press.
Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), 61–74.
Ellis, N. C., Simpson-Vlach, R., & Maynard, C. (2008). Formulaic language in native and second language speakers: Psycholinguistics, corpus linguistics, and TESOL. TESOL Quarterly, 42(3), 375–396.
Erman, B., & Warren, B. (2000). The idiom principle and the open choice principle. Text, 20(1), 29–62.
Evert, S. (2004). The statistics of word co-occurrences: Word pairs and collocations. (PhD Thesis, Universität Stuttgart).
Evert, S. (2009). Corpora and collocations. In A. Lüdeling, & M. Kytö (Eds.), Corpus linguistics: an international handbook, Vol. 2 (pp. 1212–1248). Berlin/New York: Mouton de Gruyter.
Goldberg, A. E. (1995). Constructions: a construction grammar approach to argument structure. Chicago: University of Chicago Press.
Goldberg, A. E. (2006). Constructions at work. Oxford: Oxford University Press.
Gries, S. Th. (2015). Some current quantitative problems in corpus linguistics and a sketch of some solutions. Language and Linguistics, 16(1), 93–117.
Langacker, R. W. (1987). Foundations of cognitive grammar: Vol. 1: Theoretical prerequisites. Stanford: Stanford University Press.
Lieven, E., Salomo D., & Tomasello, M. (2009). Two-year-old children’s production of multiword utterances: A usage-based analysis. Cognitive Linguistics, 20(3), 481–507.
MacWhinney, B. (2000). The CHILDES project. Tools for analyzing talk. Third edition. Mahwah, NJ: Lawrence Erlbaum Associates.
McEnery, T. (2006). Swearing in English: Bad language, purity and power from 1586 to the present. Abington: Routledge.
O’Donnell, M. B. (2011). The adjusted frequency list: A method to produce cluster-sensitive frequency lists. ICAME Journal, 35, 135–169.
Pecina, P. (2009). Lexical association measures: Collocation extraction. Prague: Charles University.
Rowland, C. F., & Fletcher, S. L. (2006). The effect of sampling on estimates of lexical specificity and error rates. The Journal of Child Language 33(4), 859–877.
Swingley, D. (2005). Statistical clustering and the contents of the infant vocabulary. Cognitive Psychology, 50, 86–132.
Tomasello, M. (2005). Constructing a language: a usage-based theory of language acquisition. Cambridge, MA: Harvard University Press.
Wahl, A., & Gries, S. Th. (2018). Multi-word expressions: A novel computational approach to their bottom-up statistical extraction. In P. L. Cantos-Gómez and M. Almela-Sánchez (Eds.), Lexical collocation analysis: advances and applications (pp. 85–109). Berlin/New York: Springer.
Wible, D., Kuo, C., Chen, M., Tsao, N., & Hung, T. (2006). A computational approach to the discovery and representation of lexical chunks. TALN 2006. Leuven, Belgium.
Wray, A. (2002). Formulaic language and the lexicon. Cambridge: Cambridge University Press.
Cited by (3)
Cited by three other publications
Kranich, Svenja & Tine Breban
This list is based on CrossRef data as of 29 july 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.