NEWYORK (GenomeWeb) – Part of the rationale that underlies efforts such as the national Precision Medicine Initiative and the American Association for Cancer Research's data-sharing initiative is that by pooling genomic and other datasets from multiple sources, scientists can build up large enough cohorts to study things like mutation function, personalize treatments, and stratify patients for clinical trials.
But the logistics of sharing data across institutions is not trivial because of factors such as differences in the platforms used to generate the data. Researchers at New York's Weill Cornell Medicine and Australia's Garvan Institute of Medical Research hope to address some of those challenges as part of an agreement to share de-identified germline and somatic mutation data as well as de-identified clinical data that they have collected from consenting cancer patients at their respective institutions.
According to a memorandum of intent signed by the institutions, the partnership provides an opportunity to merge datasets to increase statistical power and obtain new insights into genotype and phenotype associations. Researchers would also be able to better assess and interpret rare somatic and germline variants, Mark Rubin, director of Weill Cornell's Englander Institute of Precision Medicine, noted. "If a mutation only occurs in one percent of the patient population at Weill Cornell Medicine, having the added experience from the Garvan will help us determine if the prevalence of the mutation is real," he said in a statement.
Understanding rare mutations is a strong argument in favor of sharing data, according to Olivier Elemento, associate professor and associate director of the Weill Cornell Institute for Computational Biomedicine. It is hard to associate these kinds of mutations to specific clinical phenotypes, such as treatment response, if researchers only see them in a single cohort — but if it shows up in multiple independent cohorts in patients with the same clinical phenotype, "suddenly it becomes much more compelling and much more relevant from a clinical point of view," he said in a conversation with GenomeWeb.
Both Garvan and Weill Cornell have mutational profiles from well over 1,000 patients that they have sequenced over the years. "That's very valuable and important data that makes sense to share," Elemento said. Moreover, each partner has access to datasets from specific cancer subtypes that could be valuable to the other. For example, Garvan sees more melanoma cases than Cornell does. "If we have access to a large melanoma cohort, then suddenly interpretation of melanoma patients' mutational profiles here would potentially be easier," he said.
The Weill Cornell-Garvan partnership also provides a forum for the participating researchers to define and establish standards and procedures for securely sharing genomic and clinical data that they could apply to future data-sharing collaborations, Elemento noted. "We are talking about massive amounts of genomic profiles and genomic information," he said. "It is actually not so easy to share that kind of data." For example, Garvan and Weill Cornell primarily generate different kinds of data — whole-genome and whole-exome respectively, he said. "We need to learn how to combine those platforms into a unique dataset and ... how to do it in an efficient way to make it very practical for people to do on both sides."
Increasingly, communities within the oncology domain are coming together to contribute resources to a common pool to help push the field forward. For example, last year, AACR launched the Genomics, Evidence, Neoplasia Information Exchange (GENIE) initiative, an international effort to build a registry of somatic sequencing results from cancer patients along with selected longitudinal information such as outcomes data. More recently, the Global Alliance for Genomics and Health released a second tier of data on 13,000 BRCA variants gleaned from various public repositories. Meanwhile, Ambry Genetics is offering aggregate allele-frequency data from 10,000 hereditary breast and ovarian cancer patients through AmbryShare.
There are a number of options for sharing data. One recent effort is the Collaborative Cancer Cloud, an Oregon Health & Science University-led initiative that aims to provide a secure platform for sharing and analyzing large quantities of oncology data. It uses Intel-developed infrastructure to help hospitals and research institutions share private genomic, imaging, and clinical datasets. Recently, Dana-Farber and the Ontario Institute for Cancer Research signed on to participate in pilot projects with OHSU.
But there are also open-source mechanisms like Beacon, which was developed by the Global Alliance for Genomics and Health (GA4GH). It lets researchers provide basic information by answering 'yes' or 'no' questions about variants in their databases without providing details about the larger dataset or moving patients' information. That's particularly important for germline datasets, which can be used to infer some additional information about contributing individuals that they may not want to share, Elemento noted. Beacon is one of the tools that his team and their Garvan collaborators plan to use as part of their data-sharing efforts. "It [provides] a step-by-step interrogation of germline genomes that preserves privacy," he said. "It is a compelling tool to use at least to get started ... query[ing] each other's germline database."
To share somatic variants, the partners plan to use cBioPortal, a portal developed and maintained by Memorial Sloan-Kettering Cancer Center that provides access to data from the Cancer Genome Atlas and other studies. It offers tools for visualizing, exploring, and analyzing large-scale multidimensional cancer genomics data including somatic mutations, DNA copy-number alterations, methylation, and expression data. AACR uses this platform to aggregate and share clinical and genomic information from the GENIE initiative. For this project, the partners plan to install mirror cBioPortal installations at their institutions that will operate independently initially, but the datasets could eventually be merged once the appropriate legal and other requirements have been put in place, Elemento said.
They are also working with researchers at MSKCC to make customizations to cBioPortal. For example, they want to add tools that will let external users access internal implementations of cBioPortal as well as mechanisms for tracking data access, Elemento said. This way, authorized researchers at Weill Cornell and Garvan will be able to access each other's implementations of the portal. They are also packaging the software and associated dependencies using Docker to make it easier to install locally, he added.
In addition to these tools, the partners expect they will have to develop some new software to help with sharing data more effectively as the collaboration progresses, and they plan to release these to the broader community, Elemento said. They also plan to release some existing pipelines that they will share as part of their partnership including one developed at Weill Cornell for assessing whether a patient will respond to immunotherapy based on their genomic profile. They also hope to create a joint knowledgebase of clinical cancer variants and their annotations and interpretations as part of the project that will be publicly available.
Besides sharing the data, each partner also hopes to tap the other's expertise in other areas as well, Elemento said. For instance, Garvan has significant experience and infrastructure for doing germline whole-genome sequencing, he said, while Weill Cornell, for its part, has a much stronger focus on whole-exome sequencing and clinical testing primarily in cancer. "One of the ideas was to be able to learn from their expertise in the [whole-genome] space and for them to learn about what we do in terms of clinical testing in cancer," he said.