Automatic medical term extraction from Vietnamese clinical texts

Vo, Chau; Cao, Tru; Truong, Ngoc; Ngo, Trung; Bui, Dai

doi:10.1075/term.20037.vo

Article published In:

Terminology
Vol. 28:2 (2022) ► pp.299–327

Automatic medical term extraction from Vietnamese clinical texts

Chau Vo | Ho Chi Minh City University of Technology, Vietnam National University

Tru Cao | The University of Texas Health Science Center at Houston

Ngoc Truong | FPT University

Trung Ngo | Tokyo University of Agriculture and Technology

Dai Bui | Unit Corporation

In this paper, we propose the first method for automatic Vietnamese medical term discovery and extraction from clinical texts. The method combines linguistic filtering based on our defined open patterns with nested term extraction and statistical ranking using C-value. It does not require annotated corpora, external data resources, parameter settings, or term length restriction. Beside its specialty in handling Vietnamese medical terms, another novelty is that it uses Pointwise Mutual Information to split nested terms and the disjunctive acceptance condition to extract them. Evaluated on real Vietnamese electronic medical records, it achieves a precision of about 74% and recall of about 92% and is proved stably effective with small datasets. It outperforms the previous works in the same category of not using annotated corpora and external data resources. Our method and empirical evaluation analysis can lay a foundation for further research and development in Vietnamese medical term discovery and extraction.

Keywords: automatic term extraction, electronic medical record, open linguistic pattern, pointwise mutual information, statistical ranking

Article outline

1.Introduction
2.Related works
- 2.1Linguistics-based
- 2.2Statistics-based
- 2.3Machine learning-based
- 2.4Hybrid
3.The proposed method
- 3.1Method overview
- 3.2Preprocessing
- 3.3Linguistics-based candidate term extraction
  - Part-of-Speech tagging
  - Open pattern-based term extraction
  - PMI-based nested term extraction
  - Stop word-based filtering
- 3.4Statistics-based term ranking
4.Empirical evaluation
- 4.1Data descriptions
- 4.2Experiment settings and results
  - Self-Evaluation
  - Comparative evaluation
5.Conclusions
References

Published online: 9 June 2022

https://doi.org/10.1075/term.20037.vo

References (55)

Arbabi, Aryan, David R. Adams, Sanja Fidler, and Michael Brudno

2019 “Identifying clinical terms in free-text notes using ontology-guided machine learning.” In RECOMB 2019, ed. by L. J. Cowen, LNBI, 114671: 19–34. Springer-Verlag.

Aubin, Sophie, and Thierry Hamon

2006 “Improving term extraction with terminological resources.” In Proc the International Conference on Natural Language Processing: 380–387.

Barrón-Cedeño, Alberto, Gerardo Sierra, Patrick Drouin, and Sophia Ananiadou

2009 “An improved automatic term recognition method for Spanish.” In CICLing 2009, ed. by A. Gelbukh, Lecture Notes in Computer Science 54491: 125–136. Springer-Verlag.

Bonin, Francesca, Felice Dell’Orletta, Giulia Venturi, and Simonetta Montemagni

2010 “A contrastive approach to multi-word term extraction from domain corpora.” In Proc the 7th International Conference on Language Resources and Evaluation (LREC’10): 3222–3229.

Boulaknadel, Siham, Beatrice Daille, and Driss Aboutajdine

2008 “A multi-word term extraction program for Arabic language.” In Proc the 6th International Conference on Language Resources and Evaluation (LREC’08): 1485–1488.

Bouma, Gerlof

2009 “Normalized (pointwise) mutual information in collocation extraction.” In Proc GSCL: 31–40.

Bourigault, Didier

1992 “Surface grammatical analysis for the extraction of terminological noun phrases.” In Proc COLING-92: 977–981.

Bourigault, Didier and Christian Jacquemin

1999 “TERM EXTRACTION + TERM CLUSTERING: an integrated platform for computer-aided terminology.” In Proc the 9th Conference on European Chapter of the Association for Computational Linguistics (EACL’99): 15–22.

Cabré Castellví, M. Teresa

2003 “Theories of terminology: Their description, prescription and explanation.” Terminology 9 (2): 163–199.

Chaimongkol, Panot and Akiko Aizawa

2013 “Utilizing LDA clustering for technical term extraction.” In Proc the 19th Annual Meeting of the Association for Natural Language Processing (ANLP): 686–689.

Chen, Jinying, and Hong Yu

2017 “Unsupervised ensemble ranking of terms in electronic health record notes based on their importance to patients.” Journal of Biomedical Informatics: 1–30.

Chung, Teresa Mihwa

2003 “A corpus comparison approach for terminology extraction.” Terminology 9 (2): 221–246.

Church, Kenneth Ward, and Patrick Hanks

1989 “Word association norms, mutual information, and lexicography.” In Proc the 27th Annual Meetings of the Association for Computational Linguistics: 76–83.

Conrado, Merley S., Thiago A. S. Pardo, and Solange O. Rezende

2013 “Exploration of a rich feature set for automatic term extraction.” In MICAI 2013, ed. by F. Castro, A. Gelbukh, and M. González, LNAI, 82651: 342–354. Springer-Verlag.

Dagan, Ido and Ken Church

1997 “Termight: coordinating humans and machines in bilingual terminology acquisition.” Machine Translation 121: 89–107.

Daille, Béatrice

1994 “Study and implementation of combined techniques for automatic extraction of terminology.” In Proc the Balancing Act Workshop at the 32nd Annual Meeting of the ACL: 29–36.

Dias, Gaël

2003 “Multiword unit hybrid extraction.” In Proc the ACL 2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment: 41–48.

Dice, Lee R.

1945 “Measures of the amount of ecological association between species.” J. Ecology 261: 297–302.

Diep, Quang Ban

2014 Vietnamese Grammar. Education Publisher, Vietnam. In Vietnamese.

Drouin, Patrick

2003 “Term extraction using non-technical corpora as a point of leverage.” Terminology 9 (1): 99–115.

Fahmi, Ismail, Gosse Bouma, and Lonneke van der Plas

2007 “Using multilingual terms for biomedical term extraction.” In Proc the RANLP Workshop on Acquisition and Management of Multilingual Lexicons: 1–8.

Frantzi, Katerina T., and Sophia Ananiadou

1999 “The C-value/NC-value domain-independent method for multi-word term extraction.” Journal of Natural Language Processing 6 (3): 145–179.

Frantzi, Katerina, Sophia Ananiadou, and Hideki Mima

2000 “Automatic recognition of multi-word terms: the C-value/NC-value method.” Int J Digit Libr 31: 115–130.

Gao, Yuze, and Yu Yuan

2019 “Feature-less end-to-end nested term extraction.” In Proc the International Conference on Natural Language Processing and Chinese Computing: 607–616.

He, Yulan

2016 “Extracting topical phrases from clinical documents.” In Proc the 30th AAAI Conf on Artificial Intelligence: 2957–2963.

Heylen, Kris, and Dirk De Hertog

2015 “Automatic term extraction.” In Handbook of Terminology, ed. by H. J. Kockaert and F. Steurs, Vol. 11, 203–221. John Benjamins.

Kageura, Kyo, and Bin Umino

1996 “Methods of automatic term recognition – a review.” Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication 3(2): 259–289.

Krauthammer, Michael, and Goran Nenadic

2004 “Term identification in the biomedical literature.” Journal of Biomedical Informatics 371: 512–526.

Le, Hong Phuong

2016 “Vitk: a Vietnamese text processing toolkit.” (Jan. 2016) Retrieved Jan 01, 2016 from [URL]

Liu, Liangliang, Xiaojing Wu, Hui Liu, Xinyu Cao, Haitao Wang, Hongwei Zhou, and Qi Xie

2020 “A semi-supervised approach for extracting TCM clinical terms based on feature words.” BMC Medical Informatics and Decision Making 20 (Suppl 3): 118.

Liu, Wei, Bo Chuen Chung, Rui Wang, Jonathon Ng, and Nigel Morlet

2015 “A genetic algorithm enabled ensemble for unsupervised medical term extraction from clinical letters.” Health Inf Sci Syst 3 (5): 1–14.

Lossio-Ventura, Juan Antonio, Clement Jonquet, Mathieu Roche, and Maguelonne Teisseire

2016 “Biomedical term extraction: overview and a new methodology.” Information Retrieval Journal, Medical Information Retrieval 19 (1): 59–99.

Maclean, Diana Lynn, and Jeffrey Heer

2013 “Identifying medical terms in patient-authored text: a crowdsourcing-based approach.” J Am Med Inform Assoc: 1–8.

Marciniak, Malgorzata, and Agnieszka Mykowiecka

2014 “Terminology extraction from medical texts in Polish.” Journal of Biomedical Semantics 5 (24): 1–14.

2015 “Nested term recognition driven by word connection strength.” Terminology 21 (2): 1–31.

Maynard, Diana, and Sophia Ananiadou

2001 “TRUCKS: a model for automatic multi-word term recognition.” Journal of Natural Language Processing 8 (1): 101–125.

McInnes, Bridget T., Ted Pedersen, and Serguei V. Pakhomov

2007 “Determining the syntactic structure of medical terms in clinical notes.” In Proc the ACL 2007 Workshop on Biological, Translational, and Clinical Language Processing (BioNLP 2007): 9–16.

Mihalcea, Rada, and Paul Tarau

2004 “TextRank: Bringing order into text.” In Proc the 2004 Conference on Empirical Methods in Natural Language Processing: 404–411.

Nguyen, Bao An, and Don-Lin Yang

2012 “A semi-automatic approach to construct Vietnamese ontology from online text.” The International Review of Research in Open and Distributed Learning 13 (5): 148–172.

Nguyen, Hong Son, Minh Hieu Le, Chan Quan Loi Lam, and Trong Hai Duong

2017 “Smart interactive search for Vietnamese disease by using data mining-based ontology.” Journal of Information and Telecommunication 1 (2): 176–191.

Nguyen, Minh Hiep, Huyen Nguyen Thi Minh, and Quyen Ngo The

2018 “Building Resources for Vietnamese Clinical Text Processing.” Computación y Sistemas 22 (4): 1287–1294.

Nguyen, Minh-Tien, and Tri-Thanh Nguyen

2015 “DESRM: a disease extraction system for real-time monitoring.” International Journal of Computational Vision and Robotics 5 (3): 282–301.

Oliver, Antoni, and Mercè Vàzquez

2015 “TBXTools: a free, fast and flexible tool for automatic terminology extraction.” In Proc Recent Advances in Natural Language Processing: 473–479.

2020 “TermEval 2020: Using TSR Filtering Method to Improve Automatic Term Extraction.” In Proc the 6th International Workshop on Computational Terminology (COMPUTERM 2020): 106–113.

Pei, Jian, Jiawei Han, Behzad Mortazavi-Asl, Helen Pinto, Qiming Chen, Umeshwar Dayal, and Mei-Chun Hsu

2001 “PrefixSpan: Mining sequential patterns efficiently by Prefix-Projected Pattern Growth.” In Proc the 17th International Conference on Data Engineering: 1–10.

Periñán-Pascual, Carlos, and Eva M. Mestre-Mestre

2015 “DEXTER: Automatic extraction of domain-specific glossaries for language teaching.” Procedia – Social and Behavioral Sciences 1981: 377–385.

Repar, Andraž, Vid Podpečan, Anze Vavpetič, Nada Lavrač, and Senja Pollak

2019 “TermEnsembler: an ensemble learning approach to bilingual term extraction and alignment.” Terminology 25 (1): 93–120.

Samy, Doaa, Antonio Moreno-Sandoval, Conchi Bueno-Díaz, Marta Garrote-Salazar, and José M. Guirao

2012 “Medical term extraction in an Arabic medical corpus.” In Proc the 8th International Conference on Language Resources and Evaluation (LREC’12): 640–645.

Terryn, Ayla Rigouts, Patrick Drouin, Véronique Hoste, and Els Lefever

2019 “Analysing the impact of supervised machine learning on automatic term extraction: HAMLET vs TermoStat.” In Proc Recent Advances in Natural Language Processing: 1012–1021.

Terryn, Ayla Rigouts, Véronique Hoste, Joost Buysschaert, Robert Vander Stichele, Elise Van Campen, and Els Lefever

2019 “Validating multilingual hybrid automatic term extraction for search engine optimization: the use of EBM-GUIDELINES.” Argentinian Journal of Applied Linguistics: 93–108.

Terryn, Ayla Rigouts, Véronique Hoste, and Els Lefever

2018 “A gold standard for multilingual automatic term extraction from comparable corpora: term structure and translation equivalents.” In Proc the 11th International Conference on Language Resources and Evaluation (LREC 2018): 1803–1808.

Vàzquez, Mercè, and Antoni Oliver

2018 “Improving term candidates selection using terminological tokens.” Terminology 24 (1): 122–147.

Vivaldi, Jordi, Lluís Màrquez, and Horacio Rodríguez

2001 “Improving term extraction by system combination using boosting.” In ECML 2001, ed. by L. De Raedt and P. Flach, LNAI, Vol. 21671, 515–526. Springer-Verlag.

Zhang, Xing, Yan Song, and Alex Chengyu Fang

2010 “Term recognition using conditional random fields.” In Proc the 6th International Conference on Natural Language Processing and Knowledge Engineering: 1–6.

Zhang, Ziqi, Jie Gao, and Fabio Ciravegna

2017 “SemRe-Rank: improving automatic term extraction by incorporating semantic relatedness with personalised PageRank.” ACM Trans Knowl Discov Data 9 (4): 1–40.