Logo MachVis

Extracting Cancer Phenotypes from Electronic Health Records - Tapping the power of unstructured data

Extracting Cancer Phenotypes from Electronic Health Records – Tapping the power of unstructured data

According to the World Health Organization (WHO), cancer is a large group of diseases that can start in almost any organ or tissue of the body. The disease initiates when abnormal cells grow uncontrollably, going beyond their boundaries and invading other organs. WHO claims that in 2018, cancer was the second leading cause of death, with 9.6 million people falling prey to the lethal disease. Cancer persists with no formidable solution to the ailment, resulting in unfavorable physical, emotional, and financial strain on individuals, communities, and the healthcare system.
New knowledge emerging in cancer biology and deep learning enabled us to step into this rapidly evolving domain. Genomic profiling is prevalent; however, it is crucial to correlate cancer genomic and phenotypic data to understand cancer behavior fully. Cancer phenotype information includes tumor morphology (e.g., histopathologic diagnosis), laboratory results (e.g., gene amplification status), specific tumor behaviors (e.g., metastasis), and response to treatment (e.g., the effect of a chemotherapeutic agent on tumor volume).
Phenotypic information is contained in the clinicians' notes, usually free text. This unstructured information is not amenable to computational analysis, therefore, could not be used for research related to cancer phenotyping. Transforming these free text fields into practical, quantified data remains a challenging problem due to the lack of standardization. To solve these problems, we propose to develop a Natural Language Processing (NLP) based intelligent tool that can extract the cancer phenotype information from clinical notes, which are ubiquitous but unstructured data sources.
The system will be developed using the dataset provided by an external source, CureMD, which will be cleaned to reduce noise in the dataset. An ontology will then be created with the help of domain experts, and data annotation carried out using it. This will be followed by a named entity recognition model trained to extract phenotypes from the clinical notes. The resultant information will be compiled in a medically compliant format to be employed in the medical domain. The phenotypic profile of cancer generated by this method would form the basis on which further analysis could be carried out, such as the creation of a knowledge graph, resulting in immeasurable benefits to the patients and the community. Thus this work not only proposes a novel approach for extracting phenotypes, but also cements itself as a backbone for multitude of future projects.


  • Dr Muhammad Moazam Fraz


  • Saad Ahmad Khan
  • Farina Tariq