CHICAGO (GenomeWeb) – Since its launch in February 2015, the Qiagen-led Allele Frequency Community has grown from 15 founding collaborators to a "few thousand" data contributors, according to Sean Scott, vice president for genomics and bioinformatics market development at Qiagen.
There are now more than 300,000 "high-quality" genome samples available to community members, including those from public data sources, Scott said. "In most cases, these are folks contributing clinically oriented data."
The mission of the Allele Frequency Community is to foster genomics data sharing to help address challenges with sequencing interpretation. "In simple terms, think of it as genomics data crowdsourcing," Scott said.
"The Qiagen Allele Frequency Community basically provides an expansive, ethnically diverse, and freely available resource for allele frequency annotations across a very large genomics data set," Scott explained. He called it an "invaluable data reference set for the analysis and interpretation of genomes."
Qiagen created the Allele Frequency Community following its acquisition of Ingenuity Systems after determining that there was no widely available, accurate, ethnically diverse collection of human genomes to serve as a reference set for researchers. "[While] some of the early efforts around 1000 Genomes, Exome Variant Server, etc., were helping, they simply weren't sufficient or well-funded," Scott said.
"What we're trying to solve for is not only the broader interpretation challenge and the lack of an ethnically diverse reference set, but the scenario that we saw time and time again is that people were identifying prospective disease-causing variants that appeared to be rare in the general population or the publicly available sequence data sets, but in fact, some cases might be polymorphisms in other ethnic subpopulations," Scott explained.
The Allele Frequency Community is the second community-based initiative Qiagen has backed in the last few years. In 2013, the vendor offered free use of its Ingenuity Variant Analysis software as part of an effort called the Empowered Genome Community. That was an initiative that was intended to provide a secure collaborative environment for individuals who'd had their genomes sequenced to share their data with one another and access tools to interpret it.
"We've always traditionally incorporated third-party, publicly available annotations and data sets into our interpretation solutions, but we saw an opportunity to do more here," Scott recalled.
Before the advent of the Allele Frequency Community, the majority of samples were private. "Both on the public side and even on the commercial side with testing labs, there just simply wasn't good infrastructure nor incentive to foster better sharing," he said.
The public data sets also lacked the ethnic diversity Qiagen and other founding partners desired to improve the accuracy of genomics-based diagnostics, according to Scott. "When someone's looking at a prospective disease-causing variant in a particular patient or a cohort of patients and it's deemed to be rare based on the publicly available reference set, it may not be the case in all subethnic populations," he said.
"I've spent many days wondering about the impact of a rare mutation that I see in humans samples. More data is more power, and frozen data is useless data," according to one founding member, Christopher Mason of the Institute for Computational Biomedicine at Weill Cornell Medical College in New York. "So, I wanted to help build and join a community that could help address these issues," Mason said via email while traveling overseas.
Weill Cornell contributed control genomes, some data sets from its Undiagnosed Disease Program, as well as genomes related to neglected tropical diseases. Once colleagues saw Mason working with the more diverse genomic data sets from the Allele Frequency Community, many also wanted to participate, without any prompting. "They joined spontaneously," Mason said.
Participants seem to be drawn to the community by the promise of improved accuracy, Scott said. "When people are doing true genome-level interpretation, you're dealing with literally millions of variants that are observed. The real challenge is getting down to a manageable subset that is going to be biologically or clinically relevant," Scott said.
The genomics industry has nearly 20 years of experience in manually curating evidence from scientific literature. "Now that there's an ever-increasing amount of high-quality, very rich genomics data, we're finding that we can complement what we do on the literature curation side with actual data sets," Scott said.
"Now you see many companies coming to market that are focused on machine learning and artificial intelligence. What all of those things require to be successful are very large, very high-quality, very diversified data sets."
Qiagen hosts Allele Frequency Community data after structuring and modeling all the data contributions, Scott said. Anyone opting in by contributing their own sequencing data can access the entire communal data set.
"Recent enhancements have been more around improving the analytics and improving the granularity of the allele frequency data by ethnic subpopulation. We've been in parallel benchmarking how usage of the Allele Frequency Community can help with things like filtering out false positives," Scott said. He reported that Qiagen has seen the number of clinically relevant variants drop by 35 to 40 percent since the community came together.