Skip to main content
Premium Trial:

Request an Annual Quote

Sage, MSKCC Discuss Informatics Infrastructure Plans for AACR's Project GENIE


NEW YORK (GenomeWeb) – Earlier this month, the American Association for Cancer Research launched the Genomics, Evidence, Neoplasia Information Exchange (GENIE), an international data-sharing initiative focused on creating a registry of clinical next-generation sequencing results from cancer patients along with selected longitudinal information on these patients such as outcomes data.

The initiative is kicking off with a pilot project involving the seven founding member institutions of the consortium, all of whom have agreed to pool and share somatic sequencing results collected in their respective centers and limited clinical information linked to those results. The founding members of the consortium are the Center for Personalized Cancer Treatment in Utrecht, the Netherlands; Dana-Farber Cancer Institute; Institut Gustave Roussy in Villejuif, France; Johns Hopkins University; Memorial Sloan Kettering Cancer Center; Princess Margaret Cancer Centre; and Vanderbilt-Ingram Cancer Center. 

The consoritum has tapped Sage Bionetworks and the developers of the CBioPortal for Cancer Genomics, a web-based resource for visualizing and analyzing large-scale cancer genomics datasets that was developed by researchers at MSKCC, to provide the requisite informatics infrastructure for aggregating and sharing the information collected by the project. Specifically, Sage is handling the initial data processing and cleaning using Synapse, its internally developed informatics platform. The cleaned and harmonized information will then be transferred to a private instance of cBioPortal where it will be made accessible to both consortium members and the cancer research community as a whole.

During a conference call held earlier this month to announce the project, Barrett Rollins, chief scientific officer of the Dana-Farber Cancer Institute, said that the partner institutions have agreed to share single nucleotide variants as well as insertions and deletions, and some institutions will be able to provide copy number and structural variant information from patients as well. The centers will also provide specimen-level information such as details on pathology and cancer diagnosis.

Clinical data provided along with the samples will include structured elements, such as gender, age, ethnicity, cancer, and sample type, as well as dates of diagnosis, sample collection, and sequencing, among other details. There are also plans to link longitudinal data with the NGS information, enabling researchers explore outcomes such as clinical response to therapy, survival, and whether or not clinical interventions made based on sequencing results improved outcomes, Rollins said. "Those are the kind of results that we need to show to the outcomes community as well as to payers in order to make this sustainable."

As of the project launch date, consortium members have contributed about 17,000 genomic records including data from patients with metastatic disease as well as early-stage patients, Rollins said on the call. All of this data is going into Sage's Synapse, and over a three-month period it will be checked, cleaned, and normalized before being loaded into cBioPortal.

Under the terms of the consortium's master participation agreement, each institution will then have exclusive access to its own cleaned up information in cBioPortal for six months. After that, all of the data will be exclusively available for another six months to all members of the GENIE consortium, who will be able to run aggregate queries across all the contributing institutions' data in addition to their own. After that, the portal will be opened up more broadly to the cancer research community.

In the next phase of GENIE, researchers interested in accessing the data will be able to propose clinically-based queries to a data committee, which will evaluate proposals in terms of the clinical significance of the question being asked, whether or not the GENIE database offers a sufficient sample size, and if there are sufficient funds available to run the experiment, Rollins said. If the proposal passes muster, then the consortium will put together a disease-specific sub-group that will define the relevant data elements needed to answer the question and reach out to the consortium members, who will extract and curate the necessary data elements, then provide the clinical data along with the sequencing results to Synapse and then on to cBioPortal.

During the call discussing the initiative, Justin Guinney, Sage's director of computational oncology, said that his firm has been tasked to provide a versioned data repository for the GENIE project, track sequencing workflows and data provenance, and help with data de-identification. The company will also be responsible for harmonizing the data collected by the project including providing consistent gene annotation across the datasets as well as using consistent nomenclature and ontologies, he said.

"The biggest challenge in this context is that we are not working on a unified sequencing platform," Guinney told GenomeWeb in a conversation after the call. Each institution has its own sequencing panels, software pipelines, and methodologies for calling somatic variants. Panels provided by some institutions have 300 to 400 genetic loci that they target while others have less. Moreover, panels will change over time as researchers add or subtract genes from their panels, or they might begin exploring variants occurring in different regions of the genome, Guinney noted. Computational pipelines also evolve as new and improved alignment tools are developed or different genome builds come out. That makes it crucial to capture all of the metadata around the clinical centers workflows and pipelines, he said.

Currently, the company is evaluating the different formats and annotation approaches that the different centers have used and trying to figure out the best ways of unifying the outputs from the different methodologies.

"There are some tricky things that we are trying to deal with right now," said Guinney. "For example, we have some Foundation Medicine output, and we are having to essentially do some reverse engineering to figure out how that could be reprocessed and provided to cBio portal," he said. Also, "There are a lot of inconsistencies in terms of the data format, [and] the quality of the different metrics that we are receiving."

As part of its efforts, Sage is re-annotating all of the variant call files submitted by the institutions to a common definition. The company is also currently developing repositories and dashboards for histopathology data as well as pipelines for deriving consistent mapping and annotation and for de-identifying data. Sage has also provided some basic guidelines and rudimentary workflow definitions that will help contributing institutions capture metadata including ways of describing genome builds and variant callers that they use to generate their data.

"We are trying to be somewhat iterative in terms of how we evolve. ... Over time we expect this to expand quite significantly, but we expect to keep it pretty basic to begin with," Guinney said. "There are other things that we really haven't quite worked through yet in terms of histology and clinical terms that we are capturing [so] we're still working through those components"

Some of the centers will submit limited datasets to Synapse, which will be stripped of most personal identifiers but will contain information on dates. Once the data is uploaded to Synapse, Sage will strip the leftover identifiers before passing the data on cBioPortal. The reason for this, Guinney explained, is that there was some concern among consortium members that having the centers do the full de-identification themselves would result in inconsistencies across datasets, and so the task was passed on to Sage to ensure that things would be done more uniformly. There are exceptions to this scenario, however, since some centers are required by their respective institutional review boards to completely de-identify their data before sending it to Synapse — Dana Farber is an example of this.

To ensure the security of the incoming data, Sage is providing controlled-access "virtual buckets" for each participating institution to deposit their data — institutions can upload information via application programming interfaces provided by Sage as well as through a web interface to Synapse. Once the data is uploaded, Sage will then de-identify the data, as needed, before exporting it into cBioPortal. In cases where institutions must fully strip the data before uploading to Synapse, Sage is providing researchers at the institutions in question with the code that it is using for de-identification within Synapse to ensure that those datasets are consistent with the broader pool of data, Guinney said.

Meanwhile, researchers at MSKCC are setting up a password-controlled instance of the cBioPortal software internally to host the GENIE data. CBioPortal provides tools for exploring, visualizing, and analyzing multidimensional cancer genomics datasets. Its query interface lets researchers explore things like frequently mutated genes, copy number alterations or patterns of alteration, and has the ability to link these to clinical outcomes data. A public implementation of the resource provides access to datasets from the Cancer Genome Atlas among other sources that the MSKCC development team have collected and curated over time. In addition, there are some 50 different installations of the underlying open-source software running at various institutions that are using the infrastructure in both research and clinical contexts.

The MSKCC team has begun receiving the first batches of cleaned up data from Synapse. During the AACR conference call, Nikolaus Schultz, an associate attending computational oncologist at MSKCC and lead developer of the cBioPortal, said that the database currently contains about 6,840 tumor samples — from the 17,000 samples that consortium members have contributed so far — from 6,533 patients diagnosed with breast, colorectal, prostate, bladder, and non-small cell lung cancers among other tumor types.

These initial datasets offer fairly limited clinical annotation providing basic information on things like tumor type and site, details on whether a given sample comes from the primary tumor or from a recurrence or metastases, and patient age and gender, Schultz told GenomeWeb last week. "That's probably it for the very first iteration, but I think the project is going to move very quickly into trying to answer specific clinical questions, and then there'll be efforts at the institutions to collect clinical information that would be needed to answer those questions," he said.

The MSKCC developers are also contributing to Sage's efforts to process data prior to transferring it to the cBioPortal. Specifically, Schultz said that they are providing Sage with some of their own pipelines for tasks such as merging cohorts and for annotating variants. However, the primary task for MSKCC will be providing access to the cleaned-up data. Moreover, "while our system will always have the latest version [of the data] visible, [researchers] can go to [Synapse] and download prior versions as well," he added.

At the moment, the developers aren't creating any specific applications for GENIE initiative other than what's already available in the broader cBioPortal infrastructure, but Schultz expects that the project will play a role in shaping the portal over the next several years, particularly as the types of samples that are sequenced expands. For example, "We are now beginning to sequence more xenografts of patients, so those have to be visualized somehow maybe in a timeline so that it's clear that a sample xenograft was derived from a particular tumor sample," he said.

It's not clear yet how the data will be made broadly available to the community following the 12-month embargo period. The GENIE datasets might continue to be maintained separately from the broader cBioPortal data, but it also all could be combined. "I don't think we've decided the exact mechanism of that yet," Schultz said.