Computational extraction of formulaic sequences from corpora
Two case studies of a new extraction algorithm
Alexander Wahl |
Donders Institute for Brain, Cognition and Behaviour, Radboud
Stefan Th. Gries |
University of California Santa Barbara
Justus Liebig University
We describe a new algorithm for the extraction of formulaic
language from corpora. Entitled MERGE (Multi-word Expressions from the
Recursive Grouping of Elements), it iteratively combines adjacent bigrams
into progressively longer sequences based on lexical association strengths.
We then provide empirical evidence for this approach via two case studies.
First, we compare the performance of MERGE to that of another algorithm by
examining the outputs of the approaches compared with manually annotated
formulaic sequences from the spoken component of the British National
Corpus. Second, we employ two child language corpora to examine whether
MERGE can predict the formulas that the children learn based on caregiver
input. Ultimately, we show that MERGE indeed performs well, offering a
powerful approach for the extraction of formulas.
- 1.1Counting co-occurrences
- 1.2N-Gram sizes/configurations and the problem of redundancy
- 1.3Recent approaches
- 2.The MERGE algorithm
- 3.Case study 1: MERGE vs. AFL
- 3.3Interim conclusions
- 4.Case study 2: Exploring MERGE in the context of L1 acquisition
- 4.1Materials and methods
) On the phraseology of spoken English: The evidence of recurrent word-combinations
. In A. P. Cowie
(Ed.), Phraseology: Theory, analysis, and applications
(pp. 101–102). Oxford: Oxford University Press.
Arnon, I., & Snider, N.
) More than words: Frequency effects for multi-word phrases
. Journal of Memory and Language
Bannard, C., Lieven, E., & Tomasello, M.
) Modeling children’s early grammatical knowledge
. Proceedings of the National Academy of Science
Biber, D., Conrad, S., & Cortes, V.
If you look at …: Lexical bundles in university teaching and textbooks
. Applied Linguistics
, 25(3), 371–405.
) From exemplar to grammar: A probabilistic analogy-based model of language learning
. Cognitive Science
, 33(5), 752–793.
) Meaning and memory
. Forum Linguisticum
) Language, usage, and cognition
. Cambridge: Cambridge University Press.
Daudaraviĉius, V., & Marcinkeviĉienė, R.
) Accurate methods for the statistics of surprise and coincidence
. Computational Linguistics
, 19(1), 61–74.
Ellis, N. C., Simpson-Vlach, R., & Maynard, C.
) Formulaic language in native and second language speakers: Psycholinguistics, corpus linguistics, and TESOL
. TESOL Quarterly
, 42(3), 375–396.
Erman, B., & Warren, B.
) The idiom principle and the open choice principle
, 20(1), 29–62.
) The statistics of word co-occurrences: Word pairs and collocations
. (PhD Thesis, Universität Stuttgart).
) Corpora and collocations
. In A. Lüdeling
, & M. Kytö
(Eds.), Corpus linguistics: an international handbook
, Vol. 2 (pp. 1212–1248). Berlin/New York: Mouton de Gruyter.
Goldberg, A. E.
) Constructions: a construction grammar approach to argument structure
. Chicago: University of Chicago Press.
) Constructions at work
. Oxford: Oxford University Press.
Granger, S., & Meunier, F.
Gries, S. Th., & Mukherjee, J.
Gries, S. Th.
) Some current quantitative problems in corpus linguistics and a sketch of some solutions
. Language and Linguistics
, 16(1), 93–117.
Langacker, R. W.
) Foundations of cognitive grammar: Vol. 1: Theoretical prerequisites
. Stanford: Stanford University Press.
Lieven, E., Salomo D., & Tomasello, M.
) Two-year-old children’s production of multiword utterances: A usage-based analysis
. Cognitive Linguistics
, 20(3), 481–507.
) The CHILDES project. Tools for analyzing talk
. Third edition. Mahwah, NJ: Lawrence Erlbaum Associates.
) Swearing in English: Bad language, purity and power from 1586 to the present
. Abington: Routledge.
O’Donnell, M. B.
) The adjusted frequency list: A method to produce cluster-sensitive frequency lists
. ICAME Journal
, 35, 135–169.
) Lexical association measures: Collocation extraction
. Prague: Charles University.
Rowland, C. F., & Fletcher, S. L.
) The effect of sampling on estimates of lexical specificity and error rates
. The Journal of Child Language
) Statistical clustering and the contents of the infant vocabulary
. Cognitive Psychology
, 50, 86–132.
) Constructing a language: a usage-based theory of language acquisition
. Cambridge, MA: Harvard University Press.
Wahl, A., & Gries, S. Th.
) Multi-word expressions: A novel computational approach to their bottom-up statistical extraction
. In P. L. Cantos-Gómez
and M. Almela-Sánchez
(Eds.), Lexical collocation analysis: advances and applications
(pp. 85–109). Berlin/New York: Springer.
Wible, D., Kuo, C., Chen, M., Tsao, N., & Hung, T.
) A computational approach to the discovery and representation of lexical chunks
2006 Leuven, Belgium.
) Formulaic language and the lexicon
. Cambridge: Cambridge University Press.
Cited by 3 other publications
Kranich, Svenja & Tine Breban
This list is based on CrossRef data as of 19 november 2023. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.