Two case studies of a new extraction algorithm: Computational extraction of formulaic sequences from corpora

Wahl, Alexander; Gries, Stefan Th.

doi:10.1075/ivitra.24.05wah

Part of

Computational Phraseology
Edited by Gloria Corpas Pastor and Jean-Pierre Colson
[IVITRA Research in Linguistics and Literature 24] 2020
► pp. 83–110

Computational extraction of formulaic sequences from corpora

Two case studies of a new extraction algorithm

Alexander Wahl | Donders Institute for Brain, Cognition and Behaviour, Radboud University

Stefan Th. Gries | University of California Santa Barbara | Justus Liebig University

We describe a new algorithm for the extraction of formulaic language from corpora. Entitled MERGE (Multi-word Expressions from the Recursive Grouping of Elements), it iteratively combines adjacent bigrams into progressively longer sequences based on lexical association strengths. We then provide empirical evidence for this approach via two case studies. First, we compare the performance of MERGE to that of another algorithm by examining the outputs of the approaches compared with manually annotated formulaic sequences from the spoken component of the British National Corpus. Second, we employ two child language corpora to examine whether MERGE can predict the formulas that the children learn based on caregiver input. Ultimately, we show that MERGE indeed performs well, offering a powerful approach for the extraction of formulas.

Keywords: formulaic sequences, collocation extraction, lexical association, child language, MERGE, adjusted frequency list

Article outline

1.Introduction
- 1.1Counting co-occurrences
- 1.2N-Gram sizes/configurations and the problem of redundancy
- 1.3Recent approaches
2.The MERGE algorithm
3.Case study 1: MERGE vs. AFL
- 3.1Materials
- 3.2Results
- 3.3Interim conclusions
4.Case study 2: Exploring MERGE in the context of L1 acquisition
- 4.1Materials and methods
- 4.2Results
- 4.3Discussion
5.Conclusion
Notes
References

Published online: 8 May 2020

https://doi.org/10.1075/ivitra.24.05wah

References (32)

References

Altenberg, B. (1998). On the phraseology of spoken English: The evidence of recurrent word-combinations. In A. P. Cowie (Ed.), Phraseology: Theory, analysis, and applications (pp. 101–102). Oxford: Oxford University Press.

Arnon, I., & Snider, N. (2010). More than words: Frequency effects for multi-word phrases. Journal of Memory and Language 6, 67–82.

Bannard, C., Lieven, E., & Tomasello, M. (2009). Modeling children’s early grammatical knowledge. Proceedings of the National Academy of Science 106(41), 17284–17289.

Biber, D., Conrad, S., & Cortes, V., (2004). If you look at …: Lexical bundles in university teaching and textbooks. Applied Linguistics, 25(3), 371–405.

Bod, R. (2009). From exemplar to grammar: A probabilistic analogy-based model of language learning. Cognitive Science, 33(5), 752–793.

Bolinger, D. (1976). Meaning and memory. Forum Linguisticum 1, 1–14.

Bybee, J. (2010). Language, usage, and cognition. Cambridge: Cambridge University Press.

Daudaraviĉius, V., & Marcinkeviĉienė, R. (2004). Gravity counts for the boundaries of collocations. International Journal of Corpus Linguistics, 9(2), 321–348.

Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), 61–74.

Ellis, N. C., Simpson-Vlach, R., & Maynard, C. (2008). Formulaic language in native and second language speakers: Psycholinguistics, corpus linguistics, and TESOL. TESOL Quarterly, 42(3), 375–396.

Erman, B., & Warren, B. (2000). The idiom principle and the open choice principle. Text, 20(1), 29–62.

Evert, S. (2004). The statistics of word co-occurrences: Word pairs and collocations. (PhD Thesis, Universität Stuttgart).

(2009). Corpora and collocations. In A. Lüdeling, & M. Kytö (Eds.), Corpus linguistics: an international handbook, Vol. 2 (pp. 1212–1248). Berlin/New York: Mouton de Gruyter.

Goldberg, A. E. (1995). Constructions: a construction grammar approach to argument structure. Chicago: University of Chicago Press.

(2006). Constructions at work. Oxford: Oxford University Press.

Granger, S., & Meunier, F. (Eds). (2008). Phraseology. An interdisciplinary perspective. Amsterdam/Philadelphia: John Benjamins.

Gries, S. Th. (2008). Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics, 13(4), 403–437.

Gries, S. Th., & Mukherjee, J. (2010). Lexical gravity across varieties of English: An ICE-based study of n-grams in Asian Englishes. International Journal of Corpus Linguistics, 15(4), 520–548.

Gries, S. Th. (2015). Some current quantitative problems in corpus linguistics and a sketch of some solutions. Language and Linguistics, 16(1), 93–117.

Langacker, R. W. (1987). Foundations of cognitive grammar: Vol. 1: Theoretical prerequisites. Stanford: Stanford University Press.

Lieven, E., Salomo D., & Tomasello, M. (2009). Two-year-old children’s production of multiword utterances: A usage-based analysis. Cognitive Linguistics, 20(3), 481–507.

MacWhinney, B. (2000). The CHILDES project. Tools for analyzing talk. Third edition. Mahwah, NJ: Lawrence Erlbaum Associates.

McEnery, T. (2006). Swearing in English: Bad language, purity and power from 1586 to the present. Abington: Routledge.

O’Donnell, M. B. (2011). The adjusted frequency list: A method to produce cluster-sensitive frequency lists. ICAME Journal, 35, 135–169.

Pecina, P. (2009). Lexical association measures: Collocation extraction. Prague: Charles University.

Rowland, C. F., & Fletcher, S. L. (2006). The effect of sampling on estimates of lexical specificity and error rates. The Journal of Child Language 33(4), 859–877.

Swingley, D. (2005). Statistical clustering and the contents of the infant vocabulary. Cognitive Psychology, 50, 86–132.

Tomasello, M. (2005). Constructing a language: a usage-based theory of language acquisition. Cambridge, MA: Harvard University Press.

Wahl, A. (2015). Intonation unit boundaries and the storage of bigrams: Evidence from bidirectional and directional association measures. Review of Cognitive Linguistics, 13(1), 191–219.

Wahl, A., & Gries, S. Th. (2018). Multi-word expressions: A novel computational approach to their bottom-up statistical extraction. In P. L. Cantos-Gómez and M. Almela-Sánchez (Eds.), Lexical collocation analysis: advances and applications (pp. 85–109). Berlin/New York: Springer.

Wible, D., Kuo, C., Chen, M., Tsao, N., & Hung, T. (2006). A computational approach to the discovery and representation of lexical chunks. TALN 2006. Leuven, Belgium.

Wray, A. (2002). Formulaic language and the lexicon. Cambridge: Cambridge University Press.

Cited by (3)

Cited by three other publications

Kranich, Svenja & Tine Breban

2021. Lost in Change. In Lost in Change [Studies in Language Companion Series, 218], ► pp. 1 ff.

Tichý, Ondřej

2021. Corpus driven identification of lexical bundle obsolescence in Late Modern English. In Lost in Change [Studies in Language Companion Series, 218], ► pp. 101 ff.

[no author supplied]

2021. Lost in Change. In Lost in Change [Studies in Language Companion Series, 218],

This list is based on CrossRef data as of 29 july 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.