A new computing method for extracting contiguous phraseological sequences from academic text corpora
This study aims to develop a new computing method for extracting contiguous phraseological sequences (PSs) of various lengths from academic text corpora by measuring internal associations of n-grams. We construct a new normalizing algorithm of probability-weighted average for refining the MI measure and enhancing precision in extracting PSs from corpora. This computing method is applied to the data in a medium-sized text corpus of academic English. Results indicate that the resultant new MI measure can provide statistics which better reveal internal associations within an n-gram, regardless of size. Lexico-grammatical sequences extracted with this method are more complete and less arbitrary in terms of grammar and semantics. The method can be applied to treating a variety of linguistic phenomenon, ranging from well-established phrases to likely phrasal entities, thus having potentially practical applications in corpus-based studies of phraseology and natural language processing.
Keywords: internal association, n-grams, phraseology, probability-weighted average, pseudo-bigram transformation
Published online: 05 December 2013
Cited by 9 other publications
Chen, Alvin Cheng‐Hsien
García Salido, Marcos, Marcos Garcia & Margarita Alonso-Ramos
Hsu, Chan-Chia & Shu-Kai Hsieh
This list is based on CrossRef data as of 28 august 2021. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.