In an effort to broaden access to complex oncology data sets, the National Cancer Institute is preparing to unveil a new resource called the Cancer Molecular Analysis portal, which will integrate large, disparate genomics data sets from the Cancer Genome Atlas project and other cancer genomics studies.
The CMA portal is scheduled to launch next week with about a terabyte of brain cancer data from TCGA. TCGA’s ovarian and lung cancer datasets are the next scheduled to arrive, with “a continuous flow of datasets into the CMA Portal over the next few months,” said Subha Madhavan, associate director of life sciences informatics in the NCI’s Center for Biomedical Informatics and Information Technology.
The second data set to be loaded into the CMA portal will be brain glioma data from approximately 500 patients from the so-called Rembrandt study headed by the NCI’s Neuro-Oncology branch. “The reason for lining this up right after TCGA is to enable comparisons and correlations between two brain tumor datasets and neuro-oncologists will benefit from two large-scale comprehensive studies in one portal,” said Madhavan, who is managing the CMA portal project.
Other NCI-supported projects will be imported into the portal as the data become available and as the data sharing policies are worked out for those studies. Two examples are the Target study, a childhood cancer initiative to catalog genomic changes in high-risk acute lymphoblastic leukemia and neuroblastoma; and CGEMS, or the Cancer Genetic Markers of Susceptibility study, an initiative to identify genetic alterations that make people susceptible to prostate and breast cancer.
Madhavan expects Rembrandt data to be available via the CMA portal by the end of this year. The timing for Target and CGEMS in CMA portal has yet to be determined, as the data must first become available and the sharing policies worked out, she said.
The portal, part of the NCI’s Cancer Biomedical Informatics Grid project, is expected to enable researchers to integrate, visualize, and explore clinical and genomic characterization data, said Madhavan.
The initial version of the CMA portal will include genomics data from more than 200 patients suffering from glioblastoma multiforme, along with diagnosis information, treatment history, pathology status, the site of the tumor, and background on the patients’ surgery, said Madhavan.
Genome characterizations available through the portal will encompass sequence data, gene expression studies, copy number and SNP analysis, methylation studies, and miRNA expression data.
Kenneth Aldape, associate professor in MD Anderson Cancer Center’s department of pathology, told BioInform in an e-mail that he expects this data to be of great value, particularly because “for each tumor sample, multiple platforms have been used to profile the cancer genome.”
As a result, he said, “for the first time, we can integrate data from changes in the cancer cell on the DNA, RNA, and epigenetic levels. Insights gained from this integration of data will most likely lead to new ways that we can understand the molecular pathogenesis of glioblastoma.”
Navigating the portal, users can view and access mutation profiles from tumor samples in reference to the human genome, mine clinical characteristics such as survival data and tumor staging, and correlate that with mutation and genome characterization results using a number of analytical tools.
While it is true that scientists can currently accomplish these tasks using other resources, Madhavan noted that there is currently no single integrated source for this information. “Look at the number of databases one would need to access,” she said, citing clinical information, the metastatic status of patient tumors, tissue annotation, and expression data as examples.
“A lot of these tools and databases are geared toward sophisticated statisticians and analysts who know how to handle these tools, but the goal for CMA is to put [them] in the hands of the decision-makers, the physician-scientists,” said Madhavan.
The portal is designed to let these end users work with the data without expert assistance, she said, using caBIG software functionality to help scientists find the type of datasets they need, from TCGA and elsewhere. “They can put in a gene name and it will bring back a Kaplan-Meier survival chart.”
The first data set in the CMA portal is from the TCGA Pilot project, which is run jointly by the NCI and the National Human Genome Research Institute and aims to assess the feasibility of characterizing all human cancers by starting with three cancer types: brain, lung, and ovarian cancer. To date, TCGA has been making data available to the research community through a data portal launched last year.
The TCGA data portal, set up by the project’s Data Coordinating Center, supports the program’s immediate data release policy. “This is a simple FTP site wrapped into a web site,” said Madhavan, explaining that this site provides access to raw archives that the TCGA centers submitted.
The Cancer Molecular Analysis portal, however, is slated to become a comprehensive site that will include TCGA data and other data, too. “It presents analysis, summaries and allows users to link clinical data and genomic data, which is not possible in the [TCGA portal’s] FTP wrapper,” said Madhavan.
Bulk download of TCGA data will be possible through the TCGA portal, whereas the CMA portal providing analysis and data visualization capabilities under one site, said Madhavan.
“Some of the vision is to provide a unified view across multiple studies, so people can not only drill deeper into one study but they can cross-correlate and compare data across studies,” she said.
Building in User Needs
The CMA portal offers researchers several data views: a “gene view” to analyze expression, copy number, SNP, and pathway data; a “genome view” to look at entire chromosomal regions; a “clinical view,” which includes Kaplan-Meier survival plots and other data of clinical interest; and analytical tools such as GenePattern, a software platform developed by the Broad Institute that combines workflow with dozens of computational and visualization tools, or the Cancer Genome Workbench, developed by the NCI as a computational platform to integrate clinical tumor mutation profiles with the reference human genome.
Madhavan said that a key goal for the project was to maintain a user focus. “If your tools are not easy to use, you don’t get adoption, and these clinician-scientists are so busy that you don’t want these tools to have such a steep learning curve,” she said.
“I think it will help my work and others in the field,” said Herbert Newton, director of the division of neuro-oncology at Ohio State University Medical Center & James Cancer Hospital.
Newton told BioInform via e-mail that he believes the portal and this kind of data integration “will become more and more valuable as we make further progress with translational programs to develop molecular-based treatments.”
“For the first time, we can integrate data from changes in the cancer cell on the DNA, RNA, and epigenetic levels.”
In particular, the glioblastoma multiforme data set “will be very helpful to neuro-oncology researchers working on molecular aspects of high-grade gliomas,” he said. Although there is much information available on the topic, “this will be a much broader effort for characterization of these genes, with a very large and ambitious set of genes to analyze.”
Newton said he expects the TCGA glioblastoma multiforme data set to “eventually become the ‘gold standard’ for molecular characterization and analysis of GBM.”
Madhavan said that an important source of input was a use case workshop for the TCGA data portal held in January, which brought together bench researchers, clinicians, statisticians, and computer scientists who jointly defined how the portal should be configured to house TCGA data. The participants were both eager to build technology up and also take down barriers between clinical and research disciplines, said Madhavan. “It’s absolutely amazing to see what these groups can do when you put them in one room. They don’t talk to each other every day.”
The CMA data can be explored online with runtime analysis tools that are part of the portal, but it can also be downloaded for downstream analysis by biostatisticians. “Users can go in and select the data types and patients of interest for easy bulk download of data along with clinical and tissue annotations,” said Madhavan.
To obtain that functionality, the NCI team partnered with a number of external researchers, including Peter Park, a bioinformaticist at Boston’s Children’s Hospital Informatics Program and at the Harvard-MIT Division of Health Sciences and Technology who is also on the faculty of Harvard Medical School, to understand how the community will want to access that data and to create ways to let them do so.
Madhavan noted that Park was “very passionate” about how researchers will want to “slice and dice” these datasets, such as according to clinical parameters like tissue quality, in order to prepare the data for further analysis with tools of their choice.
Working the Matrix
The NCI developers worked with colleagues from Lawrence Berkeley National Laboratory, Stanford University, MD Anderson, and the University of North Carolina to create a “data access matrix,” which offers users access to different “levels” of data and is “a key functionality of the CMA portal,” she said. This group also became the portal’s beta testers.
As Madhavan explained, “level 1” data is anything that comes out of a machine, such as probe-level data in case of an Affymetrix array. “Level 2” data in that example would be CHP files with information normalized within a given sample, while “level 3” would be segmented data and “level 4” would comprise genomic regions of interest.
For the matrix, the team sought to clearly indicate to portal users what level of data they are downloading, she said. Scientists seeking to do their own analysis will want mainly raw data, such as what is found in levels 1 and 2, while others may want only processed information.
“The data matrix simply allows one to select sections of the data more easily and reduces the time and effort necessary to obtain the data in a usable format,” Park told BioInform via e-mail. In the case of copy number data, for example, “level 1 is the raw log-ratios, level 2 is normalized log-ratios, level 3 is segmented profiles, and level 4 is the regions called significant aberrations,” he said.
“For instance, a bioinformatician interested in every step of the analysis may want to download the raw data, but clinicians might want data at the level of genes,” said Park. One researcher might want to study expression levels and matched methylation levels for patients with poor survival rates, while another may want to study copy number and expression in another group of patients, he said.
An important goal for the portal, Park said, was to reduce the time that it currently takes to download public data sets and format them for analysis. “Most available data sets are poorly annotated and much effort is required by users to link different parts of the data,” he said.
Another issue is reproducibility. “In general, it is nearly impossible to replicate a result described in a paper by downloading the data and following the description given by the authors, especially when the data are complex.” The data matrix approach “is attempting to make this a bit more friendly,” he said.
Another aspect that the CMA developers considered was patient privacy. The TCGA project defined its own patient-protection policies, and “our job on the CMA portal was to implement those patient privacy protection policies to help ensure that we are protecting the research participants in a manner that is consistent with HIPAA as well as their consent forms,” said Madhavan.
However, as the portal expands to include data from other projects, it will likely encounter a range of different access models. “One has to think carefully about how this data will be shared,” said R. Mark Adams in a presentation outlining the Cancer Molecular Analysis portal at last month’s caBIG annual meeting in Washington, DC.
Adams, a senior associate at caBIG contractor Booz Allen Hamilton, added that grappling with privacy issues “can be as challenging or more challenging than informatics or technical issues.” The problem, he said, is “coming up with ways that we can safely provide widespread access to the data to the widest range of researchers in keeping with protecting the participants.”
CMA handled this challenge by using a tiered approach. One tier is open-access data, such as gene expression profiles, which are publicly available to users without a log-in, Madhavan said, adding that this information “cannot be aggregated to generate data that is unique to an individual.”
The portal also includes a controlled-access data tier, which contains clinical data and individually unique information and requires user certification for data access.
For small research labs and community-based cancer centers with only a small number of samples, researchers might use the CMA Portal to increase the statistical power of an analysis, said Madhavan, adding that everyone benefits from the portal’s “instantaneous data release” policy.
“These projects are putting out these data sets in a publicly accessible way even before the publication has come out,” she said. “This is why we are getting interest from outside the TCGA group,” she said.
As Madhavan explained, the goal of the CMA portal is “to lower the barrier to entry to the portal by making open-tier datasets available to users in an easily usable fashion.” As datasets are prepared for the portal, the access policy will need to be tailored to the dataset. For example, Target is a childhood cancer initiative to catalog genomic changes in certain types of pediatric cancers.
“Such an implementation [the open-access tier] may not readily work for Target, where children are involved and the patient privacy concerns are heightened. Hence, we may have to make some changes to the CMA portal software to implement the data release policies of the Target project,” said Madhavan.
Powered by an Integrator
The CMA Portal is powered by caBIG’s caIntegrator module, which had only been applied to smaller studies prior to the portal project. As a result, Madhavan said, the team’s first task was to see if the 1 terabyte dataset could even be loaded into it.
Madhavan said that at the “heart” of caIntegrator is CGOM, or the clinical genomics object model, which is caBIG’s standard representation for clinical and genomic findings and the annotations that go along with them.
“There is also a real-time analytic engine that provides this on-the-fly computational analysis,” she said. Users can select patient cohorts with certain criteria and punt that over to any of dozens of analytic tools, such as GenePattern.
This semantic interoperability is expected to save researchers time, Madhavan said. For example, if a scientist wants to correlate overall survival in patients with a mutation rate in a particular gene, that would require “a lot of semantic connectivity between mutation data and clinical information, [so] that is what we spent most of the time on … figuring out the semantic touch-points between these different data types.”
Adams said in his caBIG talk that an important goal of the portal is to make the data accessible to researchers in a user-friendly, integrated format. “Often the insights in this information are hidden in terms of finding how to correlate the multiple subsets of information,” he said.
Quoting part of a wish list by Daniela Gerhard, the NCI’s director of the office of cancer genomics, Adams said that the CMA portal is envisioned as a way to make this data accessible, and not by saying, “'Go to the FTP site and knock yourself out.’”
The Cancer Molecular Analysis Portal will be available here.