NEW YORK – Large genomic databases for adult cancers, such as The Cancer Genome Atlas, have pioneered the collection of petabytes worth of genomic, epigenomic, transcriptomic, and proteomic data. Alongside TCGA, the National Cancer Institute developed data hubs and coordination centers that paved the way for data harmonization and standardization, so that anyone accessing these various databases or uploading information to them started on the same baseline.
When it comes to pediatric cancer, there is no such similar genomic data repository. Though several research centers and children's hospitals have begun to collect and disseminate data on their own, there aren't any agreed-upon standards from one database to another, like there are with adult cancer databases.
In order to begin developing a model of data federation that's more in-line with what's been created for adult cancer databases, the NCI last week convened a meeting of stakeholders from academia, government, industry, and advocacy organizations for the Childhood Cancer Data Initiative (CCDI) Symposium, to discuss the challenges unique to pediatric cancer data and how to solve them.
A plethora of data
Many experts insist it is the right time for such an initiative. As researchers and clinicians work with next-generation sequencing technologies to determine which tests and assays work best to pair children with targeted treatments and which technologies will yield the most useful biological information, initiatives such as the Gabriella Miller Kids First Pediatric Research Program and the Treehouse Childhood Cancer Initiative are churning out enormous amounts of genomic information that they're also sharing with the rest of the pediatric cancer community.
St. Jude Children's Research Hospital's bioinformatics group recently launched the St. Jude Cloud, a repository of pediatric clinical genome sequencing data available in real time. The initiative is aiming to provide researchers with high-quality whole-genome, exome, and transcriptome data from consenting St. Jude patients. Data will be uploaded in a private, secure environment on a monthly basis. The idea is to give researchers and clinicians, who may not have access to high-tech sequencing resources, data that can help them better understand the biology of pediatric cancers and make treatment decisions for patients.
The database started in June with prospective data from 685 patients who had undergone clinical genomic sequencing, and data from an additional 273 people was made available in July. St. Jude anticipates adding data from 500 patients to the cloud each year. Retrospective whole-genome sequence data from 10,000 study participants is also available in the repository.
Further, St. Jude partnered with Microsoft, which has provided storage and compute capability for the cloud through Microsoft Azure. The hospital also partnered with DNAnexus to build bioinformatic analysis pipelines and visualization tools directly into the cloud platform. The resulting infrastructure is one that allows users to not only access and download the raw data, but also work within the ecosystem to analyze it, according to Alexander Gout, a bioinformatician at St. Jude who worked on building the cloud.
Tim Triche, co-director for the Center of Personalized Medicine at Children's Hospital Los Angeles (CHLA), has also started a data collection effort related to his center's use of a targeted sequencing panel that he and his team at CHLA have developed in partnership with Thermo Fisher Scientific, called the Oncomine Childhood Cancer Research Assay, or OncoKids.
The OncoKids assay was originally developed to assess the full coding regions of 44 cancer predisposition loci, tumor suppressor genes, and oncogenes; hotspots for mutations in 82 genes; amplification events in 24 genes; and 1,421 gene fusions that have been shown to be clinically relevant in a variety of childhood cancers. Since the first panel was developed, new features of importance have been identified and incorporated, and it now includes 203 unique genes and thousands of fusion drivers, according to Thermo Fisher. Triche anticipated that the panel will continue to be enhanced as new features of importance are identified through techniques like WES, WGS, and total RNA-seq.
Further, CHLA and Thermo Fisher are looking to share the data from OncoKids with other cancer researchers and clinicians. "First, there will be an opportunity to aggregate that data into an ever-larger … database of genomic features encountered in childhood cancer," Triche said in an interview. "And second, anybody running the assay would probably benefit from that knowledge as they attempt to interpret the results locally in their own institution from that assay."
Working with Thermo Fisher, he and his colleagues at CHLA created what they call the International Childhood Oncology Network (ICON) — a community of researchers and clinicians who collaborate with each other by sharing the data they gather from using OncoKids, and by helping each other to determine best practices and experimental protocols specific to childhood and young adult cancers.
ICON has also developed a database called the Childhood Cancer Knowledge Base (CCKB) to give ICON members a place to upload and share their data.
"All they have to do is submit their own material to our database. At the same time, that gives them access to the entire database to interpret their results in the context of all the other data that's been reported with the [OncoKids] assay, and all other assays that we've been able to identify from published series of patients around the world," Triche said.
CCKB also comes with a suite of bioinformatics tools to help users navigate and interpret the data, he added.
Triche presented a poster at the CCDI Symposium noting that ICON's ultimate goal is to create an international collaborative consortium for pediatric cancers. He compared CCKB and ICON to St. Jude's efforts with its cloud and the American Association for Cancer Research's Project Genomics Evidence Neoplasia Information Exchange (GENIE), a multi-phase development of a regulatory-grade registry aggregating and linking clinical-grade cancer genomic data with clinical outcomes.
The idea behind all these efforts is to improve understanding of childhood cancers, which tend to be rare. "There's not really been a mechanism to aggregate all that data," he said. "The thought is that by aggregating data, you'll have the most complete data set available."
The importance of longitudinal data
At the symposium, a group tasked with discussing scientific and clinical research data needs for therapeutic progress emphasized the importance of interconnected and annotated pediatric cancer data sets.
"The challenge that we're addressing is that pediatric cancer genomic data sets contain small numbers of samples and lack linkage to clinical, pathologic, radial, radiographic, and treatment and outcome data," Dana-Farber Cancer Institute researcher Katherine Janeway said in a presentation. "The goal here is to build an interconnected or federated data set that's annotated, with deep longitudinal data and bio specimen availability."
Indeed, Triche had also expressed concern about the limited availability of longitudinal data. Most of the data currently available in pediatric cancer databases is "frozen," he noted. That is, the clinical data that's available was frozen at the time it was submitted, and in most cases, no subsequent further information is submitted about a patient or case about treatments received, responses, outcomes, secondary malignancies, or complications
"These are critically important pieces of information. Longitudinal clinical data is perhaps the single most important category of information to pair with genomic data," Triche said. "Historically it's been one of the hardest things to do."
One of the virtues of an institutional dataset such as the CCKB, he added, is the ability to look at patients longitudinally over time, and follow up on outcomes over the long term. It's well-documented that childhood cancer survivors endure long-term complications from high-dose chemotherapies, Triche said, and that almost all childhood cancer survivors suffer some kind of treatment-related consequences. Having longitudinal genomic data in hand can help researchers and clinicians select treatments in a way to not only target tumors more precisely, but also reduce toxicity and morbidity in the long run.
"One of my great hopes is that as we get into this, whether it's CCDI or other initiatives like that, that we can all agree that continually updating the data so that we can have long-term survivorship information and information about patient response and outcome, which may be years later, is actually going to be critically important," Triche said. "Collecting the statistics and the data over a period of time is an extraordinarily challenging process because people move, they go to a different medical center; as children grow up, they get married; they change a name; they move to a different part of the country. For all those reasons, it takes a lot of human labor input and perseverance to track them down."
Despite the challenges, he added, developing mechanisms for long-term follow up and re-contact of pediatric patients will be a critical part of making these databases more valuable.
In her symposium presentation, Janeway said that some of the ways to achieve this goal include supporting the development of data standards for phenomics, or the clinical characteristics of treatment and response; supporting required efforts for contributing data to a federated or interconnected model; facilitating the acquisition of biospecimens and linking those specimens to datasets with an emphasis on serial samples; and identifying opportunities for additional genomic characterization in datasets that remain inadequate.
"We would need to identify the key genomic data sets and biospecimen repositories for integration, and could easily achieve completion of pilots that establish standards to efficiently and reproducibly generate structured phenomic data from electronic medical records [EMRs], clinical trials data, and registry data," she said. "This would accomplish being able to broadly apply best practices for generating deep longitudinal phonemic data in pediatric cancers and annotating those federated datasets with such longitudinal deep phenomics."
She also noted that another challenge for therapeutic progress is that the classification of pediatric cancer subtypes in the clinic "is imperfect and incomplete." However, she added, analytic tools or pipelines applied across datasets could be used to better identify cancer subgroups by harmonizing approaches to calling structural alterations, and integrating the analysis of structural events with other genomic events such as mutations and signatures. This could improve classification for risk stratification and target discovery and enhance the availability of diagnostics that can be used in the clinic to place patients into these pediatric cancer subtypes, Janeway said.
Children's Hospital of Philadelphia pediatric oncologist John Maris also noted that the group had discussed the need for harmonization of pediatric preclinical data. Though there is no easy way to harmonize all the data from preclinical pediatric cancer trials, a short-term one-year goal would be to have a team work to develop these standards and retrospectively go back and apply them to the data, he said. Maris further suggested this team should create visualization tools for the data that would be useful for both investigators and clinicians.
In the long term, he added, "we would like to incentivize individuals that if you're going to do a preclinical trial, it needs to be in this portal…. We do think that this will give us the ability to develop machine learning approaches to try to predict based upon single-agent response data what is the right combinatorial strategies, which is obviously where we all want to get."
Changing the culture
In a discussion on creating datasets for clinical care and associated research, the NCI's Stephen Chanock noted that while cultural thinking around pediatric cancer data does have to change, funding to incentivize these changes — and recognizing the different priorities of the various stakeholders — must also be part of the equation.
"We have these huge mounds of data in many different places that can be very informative. Methodologically, and statistically, and analytically, we have a hard time bringing these [data] together," he said. "We really need to engage our finest minds in figuring out how to do this and recognizing what's available, and what is not available, and what we can do with that kind of information. So, our overarching critical issue is a recognition of the differences in the priorities of the distinct stakeholders."
He also acknowledged that there is a lack of computational and scientific literacy among some stakeholders, noting that it will be up to genetic biologists, computational biologists, and data scientists working with clinicians to figure out ways to better educate their colleagues in the different disciplines that are required to make the best use of large data structures.
Echoing Maris' point about existing preclinical data, Chanock also said that unstructured and legacy data presents its own set of challenges, especially when it comes to misalignment of systems and approaches within, and across, individual institutions. He pointed to the St. Jude Cloud as a good example for creating and maintaining a structured data-capture environment.
Chanock's group also proposed a Prototype Master Protocol as a model for multi-institutional, multi-modal pediatric cancer trials, moving away from single-institution clinical trials. This would require shifting the culture "from many small studies to fewer larger ones that everyone would be able to access and be part of," he said.
The NCI's Lynne Penberthy also addressed the issue of limited representation of pediatric cancer patients in clinical trials, noting that there are still many pediatric cancer patients who are not enrolled in clinical trials — particularly patients who don't have access to large academic centers. The solution proposed by the team was to create a pediatric data ecosystem that could correlate and combine linked data from existing resources, such as both clinical trial networks and registries, as well as death and birth records for all pediatric cancer patients.
She also referenced the importance of longitudinal data, noting that the best way to gather such information would be to have real-time data feeds from organizations such as Foundation Medicine or the Children's Oncology Group (COG).
"The impact is that it would provide longitudinal linkage and follow-up, to identify second primary cancers in patients, other survivor-type issues, and it would provide real-time assessment opportunities of options for treatment for these patients as they come into the clinic," she said. "And it creates the ability to have a real-world set of comparison data."
Penberthy also suggested some effort be put into the development of tools that could acquire data from EMRs and exchange that data among various databases. "The impact would be that clinical trials would have the benefit of very rich datasets for treatment options and understanding of outcomes related to those treatment options," she said. "Researchers could combine data from multiple sources for complex research questions and registries would be able to capture both the breadth of data and also long-term follow-up and share that back to some of the research organizations."
Standards and harmony
In a discussion on building an infrastructure to enable federation among disparate pediatric data repositories, the NCI's Tony Kerlavage emphasized the problem of siloed data. He described the group's proposal of an "audacious goal to pilot a demonstration of interoperability across the five top pediatric data resources within one year," which would involve defining and collecting the core data elements from each of those databases and harmonizing them.
NCI's Julie Klemm reiterated the need for tools and resources that capture data in a standardized way, or map to certain standards after data has been captured. She also re-emphasized the need to use data from EMRs to inform research and clinical decision-making.
But she also addressed the need for methods to fill in the gaps where data is missing. "A key point of this meeting is that childhood cancers are rare, and we have small numbers. That's why we need to bring these data together. But we also we also need to develop methods to address the fact that there are limited data. There are missing value data," Klemm said. "How can we develop new methods to solve this missing data issue that must be overcome to apply modern methods of machine learning and deep learning?"
One suggestion from this group was to develop gold-standard datasets against which all other datasets could be benchmarked for quality. The group also suggested the creation of a central registry of some kind that would record the existence of the various pediatric cancer databases and what information they contain, so that the entire pediatric cancer research community is aware of the resources that are available.