HYPHEN
A flexible, hybrid method to map phenotype concept mentions to terminological resources
Narrative clinical records and biomedical articles constitute rich sources of information about phenotypes, i.e., markers distinguishing individuals with specific medical conditions from the general population. Phenotypes help clinicians to provide personalised treatments. However, locating information about them within huge document repositories is difficult, since each phenotypic concept can be mentioned in many ways. Normalisation methods automatically map divergent phrases to unique concepts in domain-specific terminologies, to allow location and linking of all mentions of a concept of interest. We have developed a hybrid normalisation method (HYPHEN) to handle concept mentions with wide ranging characteristics, across different text types. HYPHEN integrates various normalisation techniques that handle surface-level variations (e.g., differences in word order, word forms or acronyms/abbreviations) and lexical-level variations (where terms have similar meanings, but potentially unrelated forms). HYPHEN achieves robust performance for both biomedical academic text and narrative clinical records, and has the ability to significantly outperform related methods.
Article outline
- 1.Introduction
- 2.Related work
- 3.Methods
- 3.1Lexical variant generation
- 3.1.1Transformation between English and Neoclassical terminology
- 3.1.2Synonym searching
- 3.2Syntactic normalisation
- 3.3Acronym and abbreviation disambiguation
- 3.4Plural to singular
- 3.5Hybrid methods
- 4.Results
- 4.1Evaluation metrics
- 4.2Baseline and individual methods
- 4.3Hybrid methods
- 4.4Discussion
- 4.5Comparison with other methods
- 5.Conclusions and future work
- Acknowledgements
- Notes
-
References
References (59)
References
Alnazzawi, Noha, Paul Thompson, and Sophia Ananiadou. 2016. “Mapping Phenotypic Information in Heterogeneous Textual Sources to a Domain-Specific Terminological Resource.” PLOS ONE 11 (9): e0162287.
Ananiadou, Sophia. 1994. “A Methodology for Automatic Term Recognition.” In Proceedings of the 15th Conference on Computational Linguistics–Volume 21, 1034–1038, Kyoto, Japan.
Aronson, Alan R., and François-Michel Lang. 2010. “An Overview of Metamap: Historical Perspective and Recent Advances.” Journal of the American Medical Informatics Association 17 (3): 229–236.
Bodenreider, O. 2004. “The Unified Medical Language System (Umls): Integrating Biomedical Terminology.” Nucleic Acids Research 321: 267–270.
Bodnari, Andreea, Louise Deleger, Thomas Lavergne, Aurelie Neveol, and Pierre Zweigenbaum. 2013. “A Supervised Named-Entity Extraction System for Medical Text.” In Proceedings of the hARe/CLEF Evaluation Lab, Valencia, Spain ([URL]). Accessed 8 February 2018.
Carroll, John, Rob Koeling, and Shivani Puri. 2012. “Lexical Acquisition for Clinical Text Mining Using Distributional Similarity.” In Proceedings of the International Conference on Intelligent Text Processing and Computational Linguistics, 232–246, New Delhi, India.
Cohen, William, Pradeep Ravikumar, and Stephen Fienberg. 2003. “A Comparison of String Metrics for Matching Names and Records.” In Proceedings of the KDD Workshop on Data Cleaning and Object Consolidation, 73–78, Washington DC, USA.
Collier, Nigel, Anika Oellrich, and Tudor Groza. 2015. “Concept Selection for Phenotypes and Diseases Using Learn to Rank.” Journal of Biomedical Semantics 6 (1): 24.
Dai, Manhong, Nigam H. Shah, Wei Xuan, Mark A. Musen, Stanley J. Watson, Brian D. Athey, and Fan Meng. 2008. “An Efficient Solution for Mapping Free Text to Ontology Terms.” In Proceedings of the AMIA Summit on Translational Bioinformatics, San Francisco, USA ([URL]). Accessed 8 February 2018.
Deléger, Louise, Fiammetta Namer, and Pierre Zweigenbaum. 2007. “Defining Medical Words: Transposing Morphosemantic Analysis from French to English.” In Medinfo 2007: Proceedings of the 12th World Congress on Health (Medical) Informatics; Building Sustainable Health Systems, 535–539, Brisbane, Australia.
Doğan, Rezarta Islamaj, Robert Leaman, and Zhiyong Lu. 2014. “Ncbi Disease Corpus: A Resource for Disease Name Recognition and Concept Normalization.” Journal of Biomedical Informatics 471: 1–10.
Dogan, Rezarta Islamaj, and Zhiyong Lu. 2012. “An Inference Method for Disease Name Normalization.” In Proceedings of the AAAI Fall Symposium on Information Retrieval and Knowledge Discovery in Biomedical Text, 8–13, Arlington, USA.
Donnelly, Kevin. 2006. “Snomed-Ct: The Advanced Terminology and Coding System for Ehealth.” Studies in Health Technology and Informatics 1211: 279.
Duclos, C., A. Burgun, J. B. Lamy, P. Landais, J. M. Rodrigues, L. Soualmia, and P. Zweigenbaum. 2014. “Medical Vocabulary, Terminological Resources and Information Coding in the Health Domain.” In Medical Informatics, E-Health, edited by A. Venot, A. Burgun and Quantin, 11–41. Paris, France: Springer.
Elhadad, Noémie, Sameer Pradhan, W. W. Chapman, Suresh Manandhar, and G. K. Savova. 2015. “Semeval-2015 Task 14: Analysis of Clinical Text.” In Proceedings of Workshop on Semantic Evaluation. Association for Computational Linguistics, 303–310, Denver, USA.
Fan, Jung-wei, Navdeep Sood, and Yang Huang. 2013. “Disorder Concept Identification from Clinical Notes an Experience with the Share/Clef 2013 Challenge.” In Proceedings of the ShARe/CLEF Evaluation Lab., Valencia, Spain ([URL]). Accessed 8 February 2018.
Fellbaum, Christiane (ed.). 1998. WordNet: An Electronic Lexical Database. Cambridge, MA: MIT Press.
Fu, Xiao, and Sophia Ananiadou. 2014. “Improving the Extraction of Clinical Concepts from Clinical Records.” In Proceedings of BioTxtM14, 47–53, Reykjavik, Iceland.
Fu, Xiao, Riza Batista-Navarro, Rafal Rak, and Sophia Ananiadou. 2015. “Supporting the Annotation of Chronic Obstructive Pulmonary Disease (Copd) Phenotypes with Text Mining Workflows.” Journal of Biomedical Semantics 6 (1): 8.
Fu, Xiao, R. T. B. Batista-Navarro, Rafal Rak, and Sophia Ananiadou. 2014. “A Strategy for Annotating Clinical Records with Phenotypic Information Relating to the Chronic Obstructive Pulmonary Disease.” In Proceedings of Phenotype Day ISMB, 1–8, Boston, USA.
Groza, Tudor, Sebastian Köhler, Dawid Moldenhauer, Nicole Vasilevsky, Gareth Baynam, Tomasz Zemojtel, Lynn Marie Schriml, Warren Alden Kibbe, Paul N. Schofield, and Tim Beck. 2015. “The Human Phenotype Ontology: Semantic Unification of Common and Rare Disease.” The American Journal of Human Genetics 97 (1):111–124.
Habash, Nizar, and Bonnie Dorr. 2003. “Catvar: A Database of Categorial Variations for English.” In Proceedings of the MT Summit, 17–23, New Orleans, US.
Hamosh, Ada, Alan F. Scott, Joanna S. Amberger, Carol A. Bocchini, and Victor A. McKusick. 2005. “Online Mendelian Inheritance in Man (Omim), a Knowledgebase of Human Genes and Genetic Disorders.” Nucleic Acids Research 331 (suppl 1):D514–D517.
Han, MeiLan K., Alvar Agusti, Peter M. Calverley, Bartolome R. Celli, Gerard Criner, Jeffrey L. Curtis, Leonardo M. Fabbri, Jonathan G. Goldin, Paul W. Jones, and William MacNee. 2010. “Chronic Obstructive Pulmonary Disease Phenotypes: The Future of Copd.” American Journal of Respiratory and Critical Care Medicine 182 (5): 598–604.
Hersh, William R., and Robert A. Greenes. 1990. “Saphire – an Information Retrieval System Featuring Concept Matching, Automatic Indexing, Probabilistic Retrieval, and Hierarchical Relationships.” Computers and Biomedical Research 23 (5): 410–425.
Jaccard, Paul. 1912. “The Distribution of the Flora in the Alpine Zone.” New Phytologist 11 (2): 37–50.
Jacquemin, Christian. 1999. “Syntagmatic and Paradigmatic Representations of Term Variation.” In Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics, 341–348, Maryland, USA.
Jonquet, Clement, Nigam Shah, and Mark Musen. 2009. “The Open Biomedical Annotator.” In Proceedings of the AMIA summit on translational bioinformatics, 56–60, San Francisco, USA.
Kang, Ning, Rogier J. Barendse, Zubair Afzal, Bharat Singh, Martijn J. Schuemie, Erik M van Mulligen, and Jan A. Kors. 2010. “Erasmus Mc Approaches to the I2b2 Challenge.” In Proceedings of the 2010 i2b2/VA Workshop on Challenges in Natural Language Processing for Clinical Data, Boston, MA, USA ([URL]). Accessed 15 February 2018.
Kate, Rohit J. 2015. “Normalizing Clinical Terms Using Learned Edit Distance Patterns.” Journal of the American Medical Informatics Association 23 (2): 380–386.
Leaman, Robert, Rezarta Islamaj Doğan, and Zhiyong Lu. 2013. “Dnorm: Disease Name Normalization with Pairwise Learning to Rank.” Bioinformatics 29 (22): 2909–2917.
Leaman, Robert, Ritu Khare, and Zhiyong Lu. 2015. “Challenges in Clinical Natural Language Processing for Automated Disorder Normalization.” Journal of Biomedical Informatics 571: 28–37.
Leaman, Robert, Christopher Miller, and G. Gonzalez. 2009. “Enabling Recognition of Diseases in Biomedical Text with Machine Learning: Corpus and Benchmark.” In Proceedings of the 2009 Symposium on Languages in Biology and Medicine, 82–89, Jeju Island, South Korea.
Lee, Hsin-Chun, Yi-Yu Hsu, and Hung-Yu Kao. 2016. “Audis: An Automatic Crf-Enhanced Disease Normalization in Biomedical Text.” Database 2016: baw091.
Li, Jiao, Yueping Sun, Robin J. Johnson, Daniela Sciaky, Chih-Hsuan Wei, Robert Leaman, Allan Peter Davis, Carolyn J. Mattingly, Thomas C. Wiegers, and Zhiyong Lu. 2016. “Biocreative V Cdr Task Corpus: A Resource for Chemical Disease Relation Extraction.” Database 2016: baw068.
Maglott, Donna, Jim Ostell, Kim D. Pruitt, and Tatiana Tatusova. 2011. “Entrez Gene: Gene-Centered Information at Ncbi.” Nucleic Acids Research 391 (suppl 1): D52–D57. .
Markó, Kornél, Stefan Schulz, Olena Medelyan, and Udo Hahn. 2005. “Bootstrapping Dictionaries for Cross-Language Information Retrieval.” In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 528–535, Salvador, Brazil.
Miyao, Yusuke, and Jun’ichi Tsujii. 2008. “Feature Forest Models for Probabilistic Hpsg Parsing.” Computational Linguistics 34 (1): 35–80.
Namer, Fiammetta, and Robert Baud. 2005. “Predicting Lexical Relations between Biomedical Terms: Towards a Multilingual Morphosemantics-Based System.” Studies in Health Technology and Informatics 1161: 793–798.
Névéol, A., and P. Zweigenbaum. 2016. “Clinical Natural Language Processing in 2015: Leveraging the Variety of Texts of Clinical Interest.” IMIA Yearbook: 234–239.
Nunes, Tiago, David Campos, Sérgio Matos, and José Luís Oliveira. 2013. “Becas: Biomedical Concept Recognition Services and Visualization.” Bioinformatics 29 (15): 1915–1916.
Oellrich, Anika, Nigel Collier, Damian Smedley, and Tudor Groza. 2015. “Generation of Silver Standard Concept Annotations from Biomedical Texts with Special Relevance to Phenotypes.” PLOS ONE 10 (1): e0116040.
Okazaki, N., S. Ananiadou, and J. Tsujii. 2010. “Building a High-Quality Sense Inventory for Improved Abbreviation Disambiguation.” Bioinformatics 26 (9): 1246–1253.
Patrick, Jon, Yefeng Wang, and Peter Budd. 2007. “An Automated System for Conversion of Clinical Notes into Snomed Clinical Terminology.” In Proceedings of the Fifth Australasian Symposium on ACSW Frontiers, 219–226, Ballarat, Australia.
Pradhan, Sameer, Noémie Elhadad, Wendy Chapman, Suresh Manandhar, and Guergana Savova. 2014. “Semeval-2014 Task 7: Analysis of Clinical Text.” In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), 54–62, Dublin, Ireland.
Pradhan, Sameer, Noémie Elhadad, Brett R. South, David Martinez, Lee Christensen, Amy Vogel, Hanna Suominen, Wendy W. Chapman, and Guergana Savova. 2015. “Evaluating the State of the Art in Disorder Recognition and Normalization of the Clinical Narrative.” Journal of the American Medical Informatics Association 22 (1): 143–154.
Rais, Meriem, and Natalia Grabar. 2015. “Discovering the Role of Morphology on the Understanding of Biomedical Terminology by Paramedical Students.” In Proccedings of the 26th Medical Informatics Europe Conference, 30–34 Madrid, Spain.
Ramanan, S. V., Shereen Broido, and P Senthil Nathan. 2013. “Performance of a Multi-Class Biomedical Tagger on Clinical Records.” In Proceedings of the ShARe/CLEF Evaluation Lab., Valencia, Spain ([URL]). Accessed 8 February 2018.
Ruch, Patrick, Julien Gobeill, Christian Lovis, and Antoine Geissbühler. 2008. “Automatic Medical Encoding with Snomed Categories.” BMC Medical Informatics and Decision Making 8 (1): S6.
Savova, Guergana K., James J. Masanz, Philip V. Ogren, Jiaping Zheng, Sunghwan Sohn, Karin C. Kipper-Schuler, and Christopher G. Chute. 2010. “Mayo Clinical Text Analysis and Knowledge Extraction System (Ctakes): Architecture, Component Evaluation and Applications.” Journal of the American Medical Informatics Association 17 (5): 507–513.
Schriml, Lynn Marie, Cesar Arze, Suvarna Nadendla, Yu-Wei Wayne Chang, Mark Mazaitis, Victor Felix, Gang Feng, and Warren Alden Kibbe. 2012. “Disease Ontology: A Backbone for Disease Semantic Integration.” Nucleic Acids Research 40 (D1): D940–D946.
Suominen, Hanna, Sanna Salanterä, Sumithra Velupillai, Wendy W. Chapman, Guergana Savova, Noemie Elhadad, Sameer Pradhan, Brett R. South, Danielle L. Mowery, and Gareth J. F. Jones. 2013. “Overview of the Share/Clef Ehealth Evaluation Lab 2013.” In Proceedings of the International Conference of the Cross-Language Evaluation Forum for European Languages, 212–231, Valencia, Spain.
Tanenblatt, Michael A., Anni Coden, and Igor L. Sominsky. 2010. “The Conceptmapper Approach to Named Entity Recognition.” In Proceedings of LREC, 546–551 Valletta, Malta.
Thompson, Paul, John McNaught, Simonetta Montemagni, Nicoletta Calzolari, Riccardo Del Gratta, Vivian Lee, Simone Marchi, Monica Monachini, Piotr Pezik, and Valeria Quochi. 2011. “The Biolexicon: A Large-Scale Terminological Resource for Biomedical Text Mining.” BMC Bioinformatics 12 (1): 397.
Uzuner, Özlem, Brett R. South, Shuying Shen, and Scott L. DuVall. 2011. “2010 I2b2/Va Challenge on Concepts, Assertions, and Relations in Clinical Text.” Journal of the American Medical Informatics Association 18 (5): 552–556.
Wang, Chunye, and Ramakrishna Akella. 2013. “Ucsc’s System for Clef Ehealth 2013 Task 1.” In Proceedings of the ShARe/CLEF Evaluation Lab., Valencia, Spain ([URL]). Accessed 8 February 2018.
Wang, Liqin, Bruce E. Bray, Jianlin Shi, Guilherme Del Fiol, and Peter J. Haug. 2016. “A Method for the Development of Disease-Specific Reference Standards Vocabularies from Textual Biomedical Literature Resources.” Artificial Intelligence in Medicine 681: 47–57.
Wulff, Henrik R. 2004. “The Language of Medicine.” Journal of the Royal Society of Medicine 97 (4): 187–188.
Zhou, Xiaohua, Xiaodan Zhang, and Xiaohua Hu. 2006. “Maxmatcher: Biological Concept Extraction Using Approximate Dictionary Lookup.” In Proceedings of PRICAI 2006: Trends in Artificial Intelligence, 1145–1149, Guilin, China.
Cited by (3)
Cited by three other publications
Luo, Yen-Fu, Sam Henry, Yanshan Wang, Feichen Shen, Ozlem Uzuner & Anna Rumshisky
2020.
The 2019 n2c2/UMass Lowell shared task on clinical concept normalization.
Journal of the American Medical Informatics Association 27:10
► pp. 1529 ff.
Ju, Meizhi, Andrea D Short, Paul Thompson, Nawar Diar Bakerly, Georgios V Gkoutos, Loukia Tsaprouni & Sophia Ananiadou
2019.
Annotating and detecting phenotypic information for chronic obstructive pulmonary disease.
JAMIA Open 2:2
► pp. 261 ff.
Thompson, Paul, Sophia Daikou, Kenju Ueno, Riza Batista-Navarro, Jun’ichi Tsujii & Sophia Ananiadou
2018.
Annotation and detection of drug effects in text for pharmacovigilance.
Journal of Cheminformatics 10:1
This list is based on CrossRef data as of 27 september 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.