NEW YORK (GenomeWeb) – The American Association for Cancer Research's Project GENIE made its first public release of cancer genomic data yesterday, including almost 19,000 genomic records from patients treated at the eight participating institutions.
GENIE, which stands for Genomics Evidence Neoplasia Information Exchange, was first announced in late 2015. Initially, the project declared seven participating centers, but was soon joined by MD Anderson Cancer Center, making a total of eight first-phase members.
The argument for GENIE and similar efforts is that there are insights that can only be gleaned by the analysis of data from tens if not hundreds of thousands of patient results — something that is impossible if data from different institutions can not be combined.
This embrace of sharing has begun to gain ground more broadly in the genomics field, especially with the announcement of the Precision Medicine Initiative, and the Vice President Biden-led Cancer Moonshot by President Obama in 2015.
For example, a National Cancer Institute advisory board for the Moonshot earlier this year raised the idea for a National Cancer Data Ecosystem to collect and share large datasets..
In another marker of a potential sea change toward greater openness of genomic data, when AstraZeneca announced last year that it planned to analyze the genomes of 2 million patients to help inform its drug discovery research, it said that it would share all resulting variant data apart from IP related to its own drug development efforts.
The release this week places GENIE's dataset among the largest fully public repositories of this kind so far, project leaders said, and this is only the first of many planned updates as the project participants continue to amass, and then eventually release clinical genomic records.
The structure of GENIE is fairly simple. Participating centers agree to share all of their genomic sequencing data after an initial period of exclusivity. Individual centers first have six months to use and study their own data exclusively, then all eight centers have another six months in which they can study the data. After that, it becomes freely publicly available.
According to AACR, the 19,000 records in the release this week include 59 major cancer types, and there is data, for example, on nearly 3,000 patients with lung cancer, more than 2,000 patients with breast cancer, and more than 2,000 patients with colorectal cancer.
And as the descriptor suggests, the data is fully public, and available for anyone to access, as long as they agree to terms including promises to not attempt to identify or contact individual participants or subjects, to properly cite the resource in future publications, and to not redistribute any data without permission from the GENIE coordinating center.
The structure for the collection, harmonization, and for providing this eventual access to the GENIE data is handled by the projects' two informatics partners, Sage Bionetworks, and CBioPortal, which was originally developed at Memorial Sloan Kettering Cancer Center.
As the project was gearing up early last year, some of the main challenges the group was anticipating were in harmonization.
Each GENIE institution has its own sequencing panels, software pipelines, and methodologies for calling variants. Some use targeted panels covering hundreds of loci, while others might only cover tens. Some institutions might also target certain types of variants, like large rearrangements, while others focus only on single nucleotide variants.
In a guide released alongside the newly public data, GENIE reported, for example, that three of the centers used Thermo Fisher Scientific Ion Torrent sequencers, while the remaining five were on Illumina machines.
Five participants use analysis methods that only covered hotspot regions, three covere all coding exons and selected introns, and a single center, MSKCC, also collects sequence data from promoters regions.
Two centers, MSKCC and the Princess Margaret Cancer Centre University Health Network, sequence both tumor and normal DNA, while the rest sequence only tumor tissue.
According to the data guide, contributing GENIE centers provided their mutation data in Variant Call Format (VCF) or Mutation Annotation Format (MAF) files with additional fields for read counts supporting variant alleles, reference alleles, and total depth.
"We were taking advantage of a grassroots model of collaboration that is very new, so we didn't even know if we could pull it off, or if there could be needed harmonization," AACR Project GENIE steering committee chairperson Charles Sawyers said in an interview this week.
"But what's very clear from today's release is that a lot of harmonization was able to happen. And more than that, there was enough overlap to come to some really important conclusions that will be revealed in a manuscript that will be published in a few weeks," he added.
While many of the important lessons learned from this first year of the project in terms of biomarker or other discoveries that could impact patient care can only be answered that upcoming paper, Sawyers said that the data also speaks for itself in terms of the project's success in tackling difficult harmonization challenges.
For example, the newly released GENIE data guide reports that one issue with data from centers that performed tumor-only sequencing was that despite their use of controls to minimize mistaking germline mutation events for somatic variants, there remained a risk that the data from these centers could still contain germline variants that could theoretically be used for patient re-identification.
To address that, the GENIE consortium developed a stringent germline filtering pipeline and uniformly applied it to all variants across all the centers, the group reported.
Another challenge, Sawyers said, was in amassing and appropriately treating the clinical data that accompanies genomic results in the dataset.
All the subjects in the GENIE collection had to have level 1A clinical data, Sawyers explained, and the group had to make sure that they could harmonize this information across sites. For example, "you can imagine a name of a rare condition might be very different in Amsterdam versus Toronto or Nashville," Sawyers said.
"The real power of the way this consortium was assembled is that theories can be made of this data and questions formulated, so for patients with a particular rare mutation, maybe you can now see that there are 200 across this dataset, but then you can also look at what happened to those patients," he said.
Moving forward, the project is planning to update with future data releases on a quarterly basis. With the milestone of its first release, GENIE has also now opened a call for new participants. Sawyers said that the application process is likely to be launched by the end of the month, with decisions made in a month or two and new consortium members coming onboard sometime later in the year.