A new computing method for extracting contiguous phraseological sequences from academic text corpora

Wei, Naixing; Li, Jingjie

doi:10.1075/ijcl.18.4.03wei

Article published In:

International Journal of Corpus Linguistics
Vol. 18:4 (2013) ► pp.506–535

A new computing method for extracting contiguous phraseological sequences from academic text corpora

Naixing Wei | Beihang University

Jingjie Li | Donghua University

This study aims to develop a new computing method for extracting contiguous phraseological sequences (PSs) of various lengths from academic text corpora by measuring internal associations of n-grams. We construct a new normalizing algorithm of probability-weighted average for refining the MI measure and enhancing precision in extracting PSs from corpora. This computing method is applied to the data in a medium-sized text corpus of academic English. Results indicate that the resultant new MI measure can provide statistics which better reveal internal associations within an n-gram, regardless of size. Lexico-grammatical sequences extracted with this method are more complete and less arbitrary in terms of grammar and semantics. The method can be applied to treating a variety of linguistic phenomenon, ranging from well-established phrases to likely phrasal entities, thus having potentially practical applications in corpus-based studies of phraseology and natural language processing.

Keywords: internal association, n-grams, phraseology, probability-weighted average, pseudo-bigram transformation

Published online: 5 December 2013

https://doi.org/10.1075/ijcl.18.4.03wei

Cited by (12)

Cited by 12 other publications

Order by:

Hu, Fumao

2024. Chunk Extraction in Business English Correspondences. In An MT-Oriented Study of Corresponding Lexical Chunks in Business Correspondences from English to Chinese, ► pp. 37 ff.

Zhou, Qihong & Li Mou

2024. A Corpus-Based Study of Lexical Chunks in Chinese Academic Discourse: Extraction, Classification, and Application. In Chinese Lexical Semantics [Lecture Notes in Computer Science, 14515], ► pp. 257 ff.

Hsu, Chan-Chia & Shu-Kai Hsieh

2022. Identifying lexical bundles in Chinese. Language and Linguistics. 語言暨語言學 ► pp. 525 ff.

Buerki, Andreas

2020. Formulaic Language and Linguistic Change,

Polio, Charlene & Hyung-Jo Yoon

2020. Exploring Multi-Word Combinations as Measures of Linguistic Accuracy in Second Language Writing. In Learner Corpus Research Meets Second Language Acquisition, ► pp. 96 ff.

Chen, Alvin Cheng‐Hsien

2019. Assessing Phraseological Development in Word Sequences of Variable Lengths in Second Language Texts Using Directional Association Measures. Language Learning 69:2 ► pp. 440 ff.

García Salido, Marcos, Marcos Garcia & Margarita Alonso-Ramos

2019. Identifying Lexical Bundles for an Academic Writing Assistant in Spanish. In Computational and Corpus-Based Phraseology [Lecture Notes in Computer Science, 11755], ► pp. 144 ff.

Dobrovoljc, Kaja

2017. Multi-word discourse markers and their corpus-driven identification. International Journal of Corpus Linguistics 22:4 ► pp. 551 ff.

DUNN, JONATHAN

2017. Computational learning of construction grammars. Language and Cognition 9:2 ► pp. 254 ff.

Jeaco, Stephen

2017. Helping Language Learners Put Concordance Data in Context. International Journal of Computer-Assisted Language Learning and Teaching 7:2 ► pp. 22 ff.

Jeaco, Stephen

2020. Helping Language Learners Put Concordance Data in Context. In Language Learning and Literacy, ► pp. 71 ff.

Yoon, Hyung-Jo

2016. Association strength of verb-noun combinations in experienced NS and less experienced NNS writing: Longitudinal and cross-sectional findings. Journal of Second Language Writing 34 ► pp. 42 ff.

This list is based on CrossRef data as of 4 july 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.