HYPHEN
A flexible, hybrid method to map phenotype concept mentions to terminological resources
Narrative clinical records and biomedical articles constitute rich sources of information about phenotypes, i.e., markers distinguishing individuals with specific medical conditions from the general population. Phenotypes help clinicians to provide personalised treatments. However, locating information about them within huge document repositories is difficult, since each phenotypic concept can be mentioned in many ways. Normalisation methods automatically map divergent phrases to unique concepts in domain-specific terminologies, to allow location and linking of all mentions of a concept of interest. We have developed a hybrid normalisation method (HYPHEN) to handle concept mentions with wide ranging characteristics, across different text types. HYPHEN integrates various normalisation techniques that handle surface-level variations (e.g., differences in word order, word forms or acronyms/abbreviations) and lexical-level variations (where terms have similar meanings, but potentially unrelated forms). HYPHEN achieves robust performance for both biomedical academic text and narrative clinical records, and has the ability to significantly outperform related methods.
Article outline
- 1.Introduction
- 2.Related work
- 3.Methods
- 3.1Lexical variant generation
- 3.1.1Transformation between English and Neoclassical terminology
- 3.1.2Synonym searching
- 3.2Syntactic normalisation
- 3.3Acronym and abbreviation disambiguation
- 3.4Plural to singular
- 3.5Hybrid methods
- 4.Results
- 4.1Evaluation metrics
- 4.2Baseline and individual methods
- 4.3Hybrid methods
- 4.4Discussion
- 4.5Comparison with other methods
- 5.Conclusions and future work
- Acknowledgements
- Notes
-
References
References
Alnazzawi, Noha, Paul Thompson, and Sophia Ananiadou
2016 “
Mapping Phenotypic Information in Heterogeneous Textual Sources to a Domain-Specific Terminological Resource.”
PLOS ONE 11 (9): e0162287.
Ananiadou, Sophia
1994 “
A Methodology for Automatic Term Recognition.” In
Proceedings of the 15th Conference on Computational Linguistics–Volume 21, 1034–1038, Kyoto, Japan.
Aronson, Alan R., and François-Michel Lang
2010 “
An Overview of Metamap: Historical Perspective and Recent Advances.”
Journal of the American Medical Informatics Association 17 (3): 229–236.
Bodenreider, O.
2004 “
The Unified Medical Language System (Umls): Integrating Biomedical Terminology.”
Nucleic Acids Research 321: 267–270.
Bodnari, Andreea, Louise Deleger, Thomas Lavergne, Aurelie Neveol, and Pierre Zweigenbaum
2013 “
A Supervised Named-Entity Extraction System for Medical Text.” In
Proceedings of the hARe/CLEF Evaluation Lab, Valencia, Spain (
[URL]). Accessed 8 February 2018.
Carroll, John, Rob Koeling, and Shivani Puri
2012 “
Lexical Acquisition for Clinical Text Mining Using Distributional Similarity.” In
Proceedings of the International Conference on Intelligent Text Processing and Computational Linguistics, 232–246, New Delhi, India.
Cohen, William, Pradeep Ravikumar, and Stephen Fienberg
2003 “
A Comparison of String Metrics for Matching Names and Records.” In
Proceedings of the KDD Workshop on Data Cleaning and Object Consolidation, 73–78, Washington DC, USA.
Collier, Nigel, Anika Oellrich, and Tudor Groza
2015 “
Concept Selection for Phenotypes and Diseases Using Learn to Rank.”
Journal of Biomedical Semantics 6 (1): 24.
Dai, Manhong, Nigam H. Shah, Wei Xuan, Mark A. Musen, Stanley J. Watson, Brian D. Athey, and Fan Meng
2008 “
An Efficient Solution for Mapping Free Text to Ontology Terms.” In
Proceedings of the AMIA Summit on Translational Bioinformatics, San Francisco, USA (
[URL]). Accessed 8 February 2018.
Deléger, Louise, Fiammetta Namer, and Pierre Zweigenbaum
2007 “
Defining Medical Words: Transposing Morphosemantic Analysis from French to English.” In
Medinfo 2007: Proceedings of the 12th World Congress on Health (Medical) Informatics; Building Sustainable Health Systems, 535–539, Brisbane, Australia.
Doğan, Rezarta Islamaj, Robert Leaman, and Zhiyong Lu
2014 “
Ncbi Disease Corpus: A Resource for Disease Name Recognition and Concept Normalization.”
Journal of Biomedical Informatics 471: 1–10.
Dogan, Rezarta Islamaj, and Zhiyong Lu
2012 “
An Inference Method for Disease Name Normalization.” In
Proceedings of the AAAI Fall Symposium on Information Retrieval and Knowledge Discovery in Biomedical Text, 8–13, Arlington, USA.
Donnelly, Kevin
2006 “
Snomed-Ct: The Advanced Terminology and Coding System for Ehealth.”
Studies in Health Technology and Informatics 1211: 279.
Duclos, C., A. Burgun, J. B. Lamy, P. Landais, J. M. Rodrigues, L. Soualmia, and P. Zweigenbaum
2014 “
Medical Vocabulary, Terminological Resources and Information Coding in the Health Domain.” In
Medical Informatics, E-Health, edited by
A. Venot,
A. Burgun and
Quantin, 11–41. Paris, France: Springer.
Elhadad, Noémie, Sameer Pradhan, W. W. Chapman, Suresh Manandhar, and G. K. Savova
2015 “
Semeval-2015 Task 14: Analysis of Clinical Text.” In
Proceedings of Workshop on Semantic Evaluation. Association for Computational Linguistics, 303–310, Denver, USA.
Fan, Jung-wei, Navdeep Sood, and Yang Huang
2013 “
Disorder Concept Identification from Clinical Notes an Experience with the Share/Clef 2013 Challenge.” In
Proceedings of the ShARe/CLEF Evaluation Lab., Valencia, Spain (
[URL]). Accessed 8 February 2018.
Fellbaum, Christiane
(ed.) 1998 WordNet: An Electronic Lexical Database. Cambridge, MA: MIT Press.
Fu, Xiao, and Sophia Ananiadou
2014 “
Improving the Extraction of Clinical Concepts from Clinical Records.” In
Proceedings of BioTxtM14, 47–53, Reykjavik, Iceland.
Fu, Xiao, Riza Batista-Navarro, Rafal Rak, and Sophia Ananiadou
2015 “
Supporting the Annotation of Chronic Obstructive Pulmonary Disease (Copd) Phenotypes with Text Mining Workflows.”
Journal of Biomedical Semantics 6 (1): 8.
Fu, Xiao, R. T. B. Batista-Navarro, Rafal Rak, and Sophia Ananiadou
2014 “
A Strategy for Annotating Clinical Records with Phenotypic Information Relating to the Chronic Obstructive Pulmonary Disease.” In
Proceedings of Phenotype Day ISMB, 1–8, Boston, USA.
Groza, Tudor, Sebastian Köhler, Dawid Moldenhauer, Nicole Vasilevsky, Gareth Baynam, Tomasz Zemojtel, Lynn Marie Schriml, Warren Alden Kibbe, Paul N. Schofield, and Tim Beck
2015 “
The Human Phenotype Ontology: Semantic Unification of Common and Rare Disease.”
The American Journal of Human Genetics 97 (1):111–124.
Habash, Nizar, and Bonnie Dorr
2003 “
Catvar: A Database of Categorial Variations for English.” In
Proceedings of the MT Summit, 17–23, New Orleans, US.
Hamosh, Ada, Alan F. Scott, Joanna S. Amberger, Carol A. Bocchini, and Victor A. McKusick
2005 “
Online Mendelian Inheritance in Man (Omim), a Knowledgebase of Human Genes and Genetic Disorders.”
Nucleic Acids Research 331 (
suppl 1):D514–D517.
Han, MeiLan K., Alvar Agusti, Peter M. Calverley, Bartolome R. Celli, Gerard Criner, Jeffrey L. Curtis, Leonardo M. Fabbri, Jonathan G. Goldin, Paul W. Jones, and William MacNee
2010 “
Chronic Obstructive Pulmonary Disease Phenotypes: The Future of Copd.”
American Journal of Respiratory and Critical Care Medicine 182 (5): 598–604.
Hersh, William R., and Robert A. Greenes
1990 “
Saphire – an Information Retrieval System Featuring Concept Matching, Automatic Indexing, Probabilistic Retrieval, and Hierarchical Relationships.”
Computers and Biomedical Research 23 (5): 410–425.
Jaccard, Paul
1912 “
The Distribution of the Flora in the Alpine Zone.”
New Phytologist 11 (2): 37–50.
Jacquemin, Christian
1999 “
Syntagmatic and Paradigmatic Representations of Term Variation.” In
Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics, 341–348, Maryland, USA.
Jonquet, Clement, Nigam Shah, and Mark Musen
2009 “
The Open Biomedical Annotator.” In
Proceedings of the AMIA summit on translational bioinformatics, 56–60, San Francisco, USA.
Kang, Ning, Rogier J. Barendse, Zubair Afzal, Bharat Singh, Martijn J. Schuemie, Erik M van Mulligen, and Jan A. Kors
2010 “
Erasmus Mc Approaches to the I2b2 Challenge.” In
Proceedings of the 2010 i2b2/VA Workshop on Challenges in Natural Language Processing for Clinical Data, Boston, MA, USA (
[URL]). Accessed 15 February 2018.
Kate, Rohit J.
2015 “
Normalizing Clinical Terms Using Learned Edit Distance Patterns.”
Journal of the American Medical Informatics Association 23 (2): 380–386.
Leaman, Robert, Rezarta Islamaj Doğan, and Zhiyong Lu
2013 “
Dnorm: Disease Name Normalization with Pairwise Learning to Rank.”
Bioinformatics 29 (22): 2909–2917.
Leaman, Robert, Ritu Khare, and Zhiyong Lu
2015 “
Challenges in Clinical Natural Language Processing for Automated Disorder Normalization.”
Journal of Biomedical Informatics 571: 28–37.
Leaman, Robert, Christopher Miller, and G. Gonzalez
2009 “
Enabling Recognition of Diseases in Biomedical Text with Machine Learning: Corpus and Benchmark.” In
Proceedings of the 2009 Symposium on Languages in Biology and Medicine, 82–89, Jeju Island, South Korea.
Lee, Hsin-Chun, Yi-Yu Hsu, and Hung-Yu Kao
2016 “
Audis: An Automatic Crf-Enhanced Disease Normalization in Biomedical Text.”
Database 2016: baw091.
Li, Jiao, Yueping Sun, Robin J. Johnson, Daniela Sciaky, Chih-Hsuan Wei, Robert Leaman, Allan Peter Davis, Carolyn J. Mattingly, Thomas C. Wiegers, and Zhiyong Lu
2016 “
Biocreative V Cdr Task Corpus: A Resource for Chemical Disease Relation Extraction.”
Database 2016: baw068.
Maglott, Donna, Jim Ostell, Kim D. Pruitt, and Tatiana Tatusova
2011 “
Entrez Gene: Gene-Centered Information at Ncbi.”
Nucleic Acids Research 391 (suppl 1): D52–D57.
.
Markó, Kornél, Stefan Schulz, Olena Medelyan, and Udo Hahn
2005 “
Bootstrapping Dictionaries for Cross-Language Information Retrieval.” In
Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 528–535, Salvador, Brazil.
Miyao, Yusuke, and Jun’ichi Tsujii
2008 “
Feature Forest Models for Probabilistic Hpsg Parsing.”
Computational Linguistics 34 (1): 35–80.
Namer, Fiammetta, and Robert Baud
2005 “
Predicting Lexical Relations between Biomedical Terms: Towards a Multilingual Morphosemantics-Based System.”
Studies in Health Technology and Informatics 1161: 793–798.
Névéol, A., and P. Zweigenbaum
2016 “
Clinical Natural Language Processing in 2015: Leveraging the Variety of Texts of Clinical Interest.”
IMIA Yearbook: 234–239.
Nunes, Tiago, David Campos, Sérgio Matos, and José Luís Oliveira
2013 “
Becas: Biomedical Concept Recognition Services and Visualization.”
Bioinformatics 29 (15): 1915–1916.
Oellrich, Anika, Nigel Collier, Damian Smedley, and Tudor Groza
2015 “
Generation of Silver Standard Concept Annotations from Biomedical Texts with Special Relevance to Phenotypes.”
PLOS ONE 10 (1): e0116040.
Okazaki, N., S. Ananiadou, and J. Tsujii
2010 “
Building a High-Quality Sense Inventory for Improved Abbreviation Disambiguation.”
Bioinformatics 26 (9): 1246–1253.
Patrick, Jon, Yefeng Wang, and Peter Budd
2007 “
An Automated System for Conversion of Clinical Notes into Snomed Clinical Terminology.” In
Proceedings of the Fifth Australasian Symposium on ACSW Frontiers, 219–226, Ballarat, Australia.
Pradhan, Sameer, Noémie Elhadad, Wendy Chapman, Suresh Manandhar, and Guergana Savova
2014 “
Semeval-2014 Task 7: Analysis of Clinical Text.” In
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), 54–62, Dublin, Ireland.
Pradhan, Sameer, Noémie Elhadad, Brett R. South, David Martinez, Lee Christensen, Amy Vogel, Hanna Suominen, Wendy W. Chapman, and Guergana Savova
2015 “
Evaluating the State of the Art in Disorder Recognition and Normalization of the Clinical Narrative.”
Journal of the American Medical Informatics Association 22 (1): 143–154.
Rais, Meriem, and Natalia Grabar
2015 “
Discovering the Role of Morphology on the Understanding of Biomedical Terminology by Paramedical Students.” In
Proccedings of the 26th Medical Informatics Europe Conference, 30–34 Madrid, Spain.
Ramanan, S. V., Shereen Broido, and P Senthil Nathan
2013 “
Performance of a Multi-Class Biomedical Tagger on Clinical Records.” In
Proceedings of the ShARe/CLEF Evaluation Lab., Valencia, Spain (
[URL]). Accessed 8 February 2018.
Ruch, Patrick, Julien Gobeill, Christian Lovis, and Antoine Geissbühler
2008 “
Automatic Medical Encoding with Snomed Categories.”
BMC Medical Informatics and Decision Making 8 (1): S6.
Savova, Guergana K., James J. Masanz, Philip V. Ogren, Jiaping Zheng, Sunghwan Sohn, Karin C. Kipper-Schuler, and Christopher G. Chute
2010 “
Mayo Clinical Text Analysis and Knowledge Extraction System (Ctakes): Architecture, Component Evaluation and Applications.”
Journal of the American Medical Informatics Association 17 (5): 507–513.
Schriml, Lynn Marie, Cesar Arze, Suvarna Nadendla, Yu-Wei Wayne Chang, Mark Mazaitis, Victor Felix, Gang Feng, and Warren Alden Kibbe
2012 “
Disease Ontology: A Backbone for Disease Semantic Integration.”
Nucleic Acids Research 40 (D1): D940–D946.
Suominen, Hanna, Sanna Salanterä, Sumithra Velupillai, Wendy W. Chapman, Guergana Savova, Noemie Elhadad, Sameer Pradhan, Brett R. South, Danielle L. Mowery, and Gareth J. F. Jones
2013 “
Overview of the Share/Clef Ehealth Evaluation Lab 2013.” In
Proceedings of the International Conference of the Cross-Language Evaluation Forum for European Languages, 212–231, Valencia, Spain.
Tanenblatt, Michael A., Anni Coden, and Igor L. Sominsky
2010 “
The Conceptmapper Approach to Named Entity Recognition.” In
Proceedings of LREC, 546–551 Valletta, Malta.
Thompson, Paul, John McNaught, Simonetta Montemagni, Nicoletta Calzolari, Riccardo Del Gratta, Vivian Lee, Simone Marchi, Monica Monachini, Piotr Pezik, and Valeria Quochi
2011 “
The Biolexicon: A Large-Scale Terminological Resource for Biomedical Text Mining.”
BMC Bioinformatics 12 (1): 397.
Uzuner, Özlem, Brett R. South, Shuying Shen, and Scott L. DuVall
2011 “
2010 I2b2/Va Challenge on Concepts, Assertions, and Relations in Clinical Text.”
Journal of the American Medical Informatics Association 18 (5): 552–556.
Wang, Chunye, and Ramakrishna Akella
2013 “
Ucsc’s System for Clef Ehealth 2013 Task 1.” In
Proceedings of the ShARe/CLEF Evaluation Lab., Valencia, Spain (
[URL]). Accessed 8 February 2018.
Wang, Liqin, Bruce E. Bray, Jianlin Shi, Guilherme Del Fiol, and Peter J. Haug
2016 “
A Method for the Development of Disease-Specific Reference Standards Vocabularies from Textual Biomedical Literature Resources.”
Artificial Intelligence in Medicine 681: 47–57.
Wulff, Henrik R.
2004 “
The Language of Medicine.”
Journal of the Royal Society of Medicine 97 (4): 187–188.
Zhou, Xiaohua, Xiaodan Zhang, and Xiaohua Hu
2006 “
Maxmatcher: Biological Concept Extraction Using Approximate Dictionary Lookup.” In
Proceedings of PRICAI 2006: Trends in Artificial Intelligence, 1145–1149, Guilin, China.
Cited by
Cited by 3 other publications
Ju, Meizhi, Andrea D Short, Paul Thompson, Nawar Diar Bakerly, Georgios V Gkoutos, Loukia Tsaprouni & Sophia Ananiadou
2019.
Annotating and detecting phenotypic information for chronic obstructive pulmonary disease.
JAMIA Open 2:2
► pp. 261 ff.
Luo, Yen-Fu, Sam Henry, Yanshan Wang, Feichen Shen, Ozlem Uzuner & Anna Rumshisky
2020.
The 2019 n2c2/UMass Lowell shared task on clinical concept normalization.
Journal of the American Medical Informatics Association 27:10
► pp. 1529 ff.
Thompson, Paul, Sophia Daikou, Kenju Ueno, Riza Batista-Navarro, Jun’ichi Tsujii & Sophia Ananiadou
2018.
Annotation and detection of drug effects in text for pharmacovigilance.
Journal of Cheminformatics 10:1
This list is based on CrossRef data as of 8 april 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.