In this paper, we propose the first method for automatic Vietnamese medical term discovery and extraction from
clinical texts. The method combines linguistic filtering based on our defined open patterns with nested term extraction and
statistical ranking using C-value. It does not require annotated corpora, external data resources, parameter
settings, or term length restriction. Beside its specialty in handling Vietnamese medical terms, another novelty is that it uses
Pointwise Mutual Information to split nested terms and the disjunctive acceptance condition to extract them. Evaluated on real
Vietnamese electronic medical records, it achieves a precision of about 74% and recall of about 92% and is proved stably effective
with small datasets. It outperforms the previous works in the same category of not using annotated corpora and external data
resources. Our method and empirical evaluation analysis can lay a foundation for further research and development in Vietnamese
medical term discovery and extraction.
Arbabi, Aryan, David
R. Adams, Sanja Fidler, and Michael Brudno
2019 “Identifying
clinical terms in free-text notes using ontology-guided machine
learning.” In RECOMB 2019, ed.
by L. J. Cowen, LNBI, 114671: 19–34. Springer-Verlag.
Aubin, Sophie, and Thierry Hamon
2006 “Improving
term extraction with terminological resources.” In Proc the
International Conference on Natural Language
Processing: 380–387.
Barrón-Cedeño, Alberto, Gerardo Sierra, Patrick Drouin, and Sophia Ananiadou
2009 “An
improved automatic term recognition method for Spanish.” In CICLing
2009, ed. by A. Gelbukh, Lecture
Notes in Computer
Science 54491: 125–136. Springer-Verlag.
Bonin, Francesca, Felice Dell’Orletta, Giulia Venturi, and Simonetta Montemagni
2010 “A
contrastive approach to multi-word term extraction from domain
corpora.” In Proc the 7th International Conference on Language
Resources and
Evaluation (LREC’10): 3222–3229.
Boulaknadel, Siham, Beatrice Daille, and Driss Aboutajdine
2008 “A
multi-word term extraction program for Arabic language.” In Proc the
6th International Conference on Language Resources and
Evaluation (LREC’08): 1485–1488.
Bouma, Gerlof
2009 “Normalized
(pointwise) mutual information in collocation extraction.” In Proc
GSCL: 31–40.
Bourigault, Didier
1992 “Surface
grammatical analysis for the extraction of terminological noun
phrases.” In Proc
COLING-92: 977–981.
Bourigault, Didier and Christian Jacquemin
1999 “TERM
EXTRACTION + TERM CLUSTERING: an integrated platform for computer-aided
terminology.” In Proc the 9th Conference on European Chapter of the
Association for Computational Linguistics
(EACL’99): 15–22.
2013 “Utilizing
LDA clustering for technical term extraction.” In Proc the 19th
Annual Meeting of the Association for Natural Language Processing
(ANLP): 686–689.
Chen, Jinying, and Hong Yu
2017 “Unsupervised
ensemble ranking of terms in electronic health record notes based on their importance to
patients.” Journal of Biomedical
Informatics: 1–30.
1989 “Word
association norms, mutual information, and lexicography.” In Proc the
27th Annual Meetings of the Association for Computational
Linguistics: 76–83.
Conrado, Merley
S., Thiago
A. S. Pardo, and Solange
O. Rezende
2013 “Exploration
of a rich feature set for automatic term extraction.” In MICAI
2013, ed. by F. Castro, A. Gelbukh, and M. González, LNAI, 82651: 342–354. Springer-Verlag.
Dagan, Ido and Ken Church
1997 “Termight:
coordinating humans and machines in bilingual terminology acquisition.” Machine
Translation 121: 89–107.
Daille, Béatrice
1994 “Study
and implementation of combined techniques for automatic extraction of
terminology.” In Proc the Balancing Act Workshop at the 32nd Annual
Meeting of the ACL: 29–36.
Dias, Gaël
2003 “Multiword
unit hybrid extraction.” In Proc the ACL 2003 Workshop on Multiword
Expressions: Analysis, Acquisition and Treatment: 41–48.
Dice, Lee
R.
1945 “Measures of the amount of
ecological association between species.” J.
Ecology 261: 297–302.
Diep, Quang
Ban
2014Vietnamese
Grammar. Education Publisher, Vietnam. In Vietnamese.
Fahmi, Ismail, Gosse Bouma, and Lonneke
van
der Plas
2007 “Using
multilingual terms for biomedical term extraction.” In Proc the RANLP
Workshop on Acquisition and Management of Multilingual
Lexicons: 1–8.
Frantzi, Katerina
T., and Sophia Ananiadou
1999 “The
C-value/NC-value domain-independent method for multi-word term extraction.” Journal of Natural
Language
Processing 6 (3): 145–179.
Frantzi, Katerina, Sophia Ananiadou, and Hideki Mima
2000 “Automatic
recognition of multi-word terms: the C-value/NC-value method.” Int J Digit
Libr 31: 115–130.
Gao, Yuze, and Yu Yuan
2019 “Feature-less
end-to-end nested term extraction.” In Proc the International
Conference on Natural Language Processing and Chinese
Computing: 607–616.
He, Yulan
2016 “Extracting
topical phrases from clinical documents.” In Proc the 30th AAAI Conf
on Artificial Intelligence: 2957–2963.
Heylen, Kris, and Dirk
De Hertog
2015 “Automatic
term extraction.” In Handbook of
Terminology, ed. by H. J. Kockaert and F. Steurs, Vol. 11, 203–221. John Benjamins.
2020 “A
semi-supervised approach for extracting TCM clinical terms based on feature words.” BMC Medical
Informatics and Decision Making 20 (Suppl
3): 118.
Liu, Wei, Bo
Chuen Chung, Rui Wang, Jonathon Ng, and Nigel Morlet
2015 “A
genetic algorithm enabled ensemble for unsupervised medical term extraction from clinical
letters.” Health Inf Sci
Syst 3 (5): 1–14.
Lossio-Ventura, Juan
Antonio, Clement Jonquet, Mathieu Roche, and Maguelonne Teisseire
2016 “Biomedical
term extraction: overview and a new methodology.” Information Retrieval Journal, Medical
Information
Retrieval 19 (1): 59–99.
Maclean, Diana
Lynn, and Jeffrey Heer
2013 “Identifying
medical terms in patient-authored text: a crowdsourcing-based approach.” J Am Med Inform
Assoc: 1–8.
Marciniak, Malgorzata, and Agnieszka Mykowiecka
2014 “Terminology
extraction from medical texts in Polish.” Journal of Biomedical
Semantics 5 (24): 1–14.
Marciniak, Malgorzata, and Agnieszka Mykowiecka
2015 “Nested
term recognition driven by word connection
strength.” Terminology 21 (2): 1–31.
Maynard, Diana, and Sophia Ananiadou
2001 “TRUCKS:
a model for automatic multi-word term recognition.” Journal of Natural Language
Processing 8 (1): 101–125.
McInnes, Bridget
T., Ted Pedersen, and Serguei
V. Pakhomov
2007 “Determining
the syntactic structure of medical terms in clinical notes.” In Proc
the ACL 2007 Workshop on Biological, Translational, and Clinical Language Processing (BioNLP
2007): 9–16.
Mihalcea, Rada, and Paul Tarau
2004 “TextRank:
Bringing order into text.” In Proc the 2004 Conference on Empirical
Methods in Natural Language Processing: 404–411.
Nguyen, Bao
An, and Don-Lin Yang
2012 “A
semi-automatic approach to construct Vietnamese ontology from online text.” The International
Review of Research in Open and Distributed
Learning 13 (5): 148–172.
Nguyen, Hong
Son, Minh
Hieu Le, Chan Quan
Loi Lam, and Trong
Hai Duong
2017 “Smart
interactive search for Vietnamese disease by using data mining-based ontology.” Journal of
Information and
Telecommunication 1 (2): 176–191.
Nguyen, Minh
Hiep, Huyen Nguyen
Thi Minh, and Quyen
Ngo The
2018 “Building
Resources for Vietnamese Clinical Text Processing.” Computación y
Sistemas 22 (4): 1287–1294.
Nguyen, Minh-Tien, and Tri-Thanh Nguyen
2015 “DESRM:
a disease extraction system for real-time monitoring.” International Journal of Computational
Vision and
Robotics 5 (3): 282–301.
Oliver, Antoni, and Mercè Vàzquez
2015 “TBXTools:
a free, fast and flexible tool for automatic terminology
extraction.” In Proc Recent Advances in Natural Language
Processing: 473–479.
Oliver, Antoni, and Mercè Vàzquez
2020 “TermEval
2020: Using TSR Filtering Method to Improve Automatic Term
Extraction.” In Proc the 6th International Workshop on Computational
Terminology (COMPUTERM
2020): 106–113.
Pei, Jian, Jiawei Han, Behzad Mortazavi-Asl, Helen Pinto, Qiming Chen, Umeshwar Dayal, and Mei-Chun Hsu
2001 “PrefixSpan:
Mining sequential patterns efficiently by Prefix-Projected Pattern
Growth.” In Proc the 17th International Conference on Data
Engineering: 1–10.
Periñán-Pascual, Carlos, and Eva
M. Mestre-Mestre
2015 “DEXTER:
Automatic extraction of domain-specific glossaries for language teaching.” Procedia – Social
and Behavioral
Sciences 1981: 377–385.
Repar, Andraž, Vid Podpečan, Anze Vavpetič, Nada Lavrač, and Senja Pollak
2019 “TermEnsembler:
an ensemble learning approach to bilingual term extraction and
alignment.” Terminology 25 (1): 93–120.
Samy, Doaa, Antonio Moreno-Sandoval, Conchi Bueno-Díaz, Marta Garrote-Salazar, and José
M. Guirao
2012 “Medical
term extraction in an Arabic medical corpus.” In Proc the 8th
International Conference on Language Resources and
Evaluation (LREC’12): 640–645.
Terryn, Ayla
Rigouts, Patrick Drouin, Véronique Hoste, and Els Lefever
2019 “Analysing
the impact of supervised machine learning on automatic term extraction: HAMLET vs
TermoStat.” In Proc Recent Advances in Natural Language
Processing: 1012–1021.
Terryn, Ayla
Rigouts, Véronique Hoste, Joost Buysschaert, Robert
Vander Stichele, Elise
Van Campen, and Els Lefever
2019 “Validating
multilingual hybrid automatic term extraction for search engine optimization: the use of
EBM-GUIDELINES.” Argentinian Journal of Applied
Linguistics: 93–108.
Terryn, Ayla
Rigouts, Véronique Hoste, and Els Lefever
2018 “A
gold standard for multilingual automatic term extraction from comparable corpora: term structure and translation
equivalents.” In Proc the 11th International Conference on Language
Resources and Evaluation (LREC
2018): 1803–1808.
Vàzquez, Mercè, and Antoni Oliver
2018 “Improving
term candidates selection using terminological
tokens.” Terminology 24 (1): 122–147.
Vivaldi, Jordi, Lluís Màrquez, and Horacio Rodríguez
2001 “Improving
term extraction by system combination using boosting.” In ECML
2001, ed. by L.
De Raedt and P. Flach, LNAI, Vol. 21671, 515–526. Springer-Verlag.
Zhang, Xing, Yan Song, and Alex
Chengyu Fang
2010 “Term
recognition using conditional random fields.” In Proc the 6th
International Conference on Natural Language Processing and Knowledge
Engineering: 1–6.
Zhang, Ziqi, Jie Gao, and Fabio Ciravegna
2017 “SemRe-Rank:
improving automatic term extraction by incorporating semantic relatedness with personalised
PageRank.” ACM Trans Knowl Discov
Data 9 (4): 1–40.