NEW YORK (GenomeWeb) – Informatics researchers from the University of Pittsburgh and Boston Children's Hospital have received a $696,000, five-year grant from the National Cancer Institute, which they will use to develop new natural language processing-based (NLP) tools for extracting cancer phenotype information from the clinical texts that are part of electronic medical records.
The researchers assert in their grant abstract that although the narratives that clinical texts tell remain "one of the most important sources of phenotype information" with "the potential to enable new insights about cancer initiation, progression, metastasis, and response to treatment … current models for correlating EMR data with omics data largely ignore [it]."
The proposed research, they wrote, will use "emerging standards in phenotype knowledge representation and NLP" to "automate cancer deep phenotype extraction from clinical text" thus enhancing cancer researchers' ability to use unstructured data sources in their translational research programs.
The teams are co-led by Rebecca Crowley, an associate professor of biomedical informatics, intelligent systems, and pathology at UPitt, and Guergana Savova, an associate professor at Boston Children's Hospital and Harvard Medical School. In a conversation with BioInform this week, Crowley said that in the first phase the teams will work on building cancer-specific modules for a software package called the Clinical Text Analysis and Knowledge Extraction System (cTAKES).
CTAKES is an open source NLP-based system developed by Savova's team for extracting information from biomedical texts. It has been used more broadly to mine information in projects such as Integrating Informatics and Biology to the Bedside (i2b2) where it was used to obtain data on patients' status as it related to conditions such as multiple sclerosis, inflammatory bowel disease, and type 2 diabetes; and its also been applied in the Electronic Medical Record and Genomics (eMERGE) project to identify patients with peripheral arterial disease.
CTAKES works by parsing medical texts into different parts and identifying the terms within the text that encode concepts relevant to diseases, disorders, medications, signs and symptoms, procedures, medications, and anatomical terms, Savova explained to BioInform. It specifcally identifies SNOMED CT terms, which are the adopted standard for the medical domain, but it does not cover cancer-specific concepts, which includes things like stages, grading, metastases, and secondary and primary tumors sites and distinctions between the two. That’s the "kind of encoding we have to implement in cTAKES," she said. They'll use standardized terminologies such as those provided by the College of American Pathologists for describing pathology reports.
The modules developed under this grant will be available in the open source cTAKES software, which is available under an Apache license. They will enable tumor researchers to annotate information in texts including things like cancer diagnosis, histopathology information, and so on, Crowley said. In addition, these modules will extract more in-depth information about the patient, beyond things like the type and stage of cancer, such as patients' co-morbidities, procedures, family histories, previous diseases, mental states, and so on, Savova noted.
A second project planned for the grant will focus on developing methods of extracting meaningful variant information from clinical texts. "Currently, most of the information we have about somatic mutations on specific cancer patients is buried in the text of a molecular path report for example," Crowley explained. "Right now, if we wanted to get that information out of an EMR, it would be almost impossible to do it without NLP [and so] one of the things we are going to be doing is to create methods that would allow people to begin to do that."
Another component of the grant will be to develop methods for "temporal extraction," in other words, developing methods of gathering information on the progression of a patient's tumor and methods of coalescing information on multiple time points that may be stored in separate documents.
For example, Crowley said "if you were to extract information out of a pathology report … and we know for example from a biopsy or a resection of a breast tumor that the patient's pathologic stage is T2N1M0, we know something about the size of the tumor, we know something about the node status and we know something about the metastasis" — the letters stand for tumor, node, and metastasis, some of the parameters used to describe solid tumors.
At a later date, a second document such as a radiology report may be created that reveals, for example, that "there is a metastasis to the brain — so M no longer equals 0 — that also would be documented in text," Crowley said. "We could extract that information from the text itself but we would really have no way to kind of relate [the information in] those documents." So, this part of the project will focus on providing tools that investigators can use "to extract these higher level representations of a patient," she added.
Initial efforts will focus on phenotypes associated with breast, melanoma, and ovarian cancers. Crowley and Savova's teams will be working on breast cancer with Adrian Lee, professor of in UPitt's pharmacology department; on melanoma with John Kirkwood, the vice chairman for clinical research at UPitt's medical school; and on ovarian cancer with Robert Edwards, a professor and vice chair of UPitt's obstetrics department. They'll also work with the informatics cabal to define what phenotypes are going to be most important, what things people really care about in terms of investigations, and what people really want to correlate, for example, with tumor mutations, Crowley said. Although they are focusing on just three tumor types, she expects that the tools that work in these domains will be general enough to work for other types of cancers.
Crowley and Savova have worked together in the past under a different grant on the Ontology Development and Information Extraction (ODIE) toolkit — a set of software components for extracting and using data in clinical documents to create biomedical ontologies. They have also collaborated on another application called the Text Information Extraction System (TIES), which uses NLP algorithms and query visualization methods to provide access to biomedical data that might be contained in pathology reports and associated tissue.
The latter system has already been deployed and is being used in multiple cancer centers, Crowley said, and she hopes to incorporate the modules developed for cTAKES into the TIES infrastructure making it broadly available to institutions that are part of that network.