CHICAGO (GenomeWeb) – This week, Google's AI division released an open-source update to DeepVariant, a variant caller powered by deep-learning technology for improving the accuracy of genome sequencing.
It is essentially a complete reworking of the DeepVariant that took home the PrecisionFDA Truth Challenge award for highest SNP accuracy in April 2016, though the winning algorithm remains the same.
"We really refocused on taking the software infrastructure that drove DeepVariant — which was tied to Google proprietary internal technologies — and rewriting the whole thing from scratch," said Mark DePristo, Google's head of deep learning for genetics and genomics and a member of Brain, a deep-learning project in the Google AI division. DePristo called the update a "significant improvement in the way we approach the problem."
The Brain team programmed it in TensorFlow, a library of open-source programming code for numerical computation that is popular for deep-learning applications. "By moving to TensorFlow, we were able to leverage more sophisticated data representations, more advanced deep-learning models, and we were able to train it much more rigorously than we were able to before, and that translates into a significant gain in accuracy even over what it was in PrecisionFDA," DePristo said.
Google took the open-source route in part to make it easier to receive input from researchers about the use cases they were interested in.
"We need to find good examples in the community of genomics problems that look like deep learning could play a role and then do the really heavy lift of figuring out how we apply deep learning tech in that area," DePristo said.
"That's all part of a broader effort to make the data of genomics compatible with the way deep-learning machinery works," he continued.
"People have lots of different data they really wish they could apply deep learning to, and we hope to make that process as smooth and easy as possible over time. Our view is that the only way to make that happen is to do the hard work of finding problems, make sure the infrastructure exists to turn them into the types of examples and labeled data sets that you need, and show how to train models on it and show that they're valuable."
DePristo said that DeepVariant's technology can learn to correct errors from each piece of sequencing equipment. "It learns the error process of the instrument from data alone," he said.
"What we like about DeepVariant is the ability to address all those issues by collecting labeled examples of the data and training it to be better at that type of data. It provides a flexibility and specialization opportunity for all of the different data types you have in a way that is fundamentally difficult in a classical statistical modeling approach," DePristo continued.
He said the technology works well on data from all makes of sequencers and eases the process of transitioning to new sequencers. "In a classical approach, you need to have specialized models for each instrument, so onboarding new instruments is a lot more work," DePristo said. "For us, we just grab the data set, sequence one of the common Genome in a Bottle samples, and add that data to the DeepVariant training system."
This, according to DePristo, frees bioinformaticians and statisticians from having to create models for each new piece of instrumentation. "If you can solve the variant calling problem in an automated way, you can repurpose them to many of the other problems that we don't have solutions for yet, but desperately need people to spend more time on," he said.
"There's only so much that we can do ourselves," said Ryan Poplin, a software engineer at Brain.
"I'm incredibly excited to see when people create their own training data sets what they can do with this thing," Poplin added. "We can apply it to basically any sequencing instrument, as far as we can tell. We can apply it to any species. Really, the sky is the limit at this point."
Brad Chapman, a research scientist in the bioinformatics core at the Harvard T.H. Chan School of Public Health, was a beta tester for the new, open-source version of DeepVariant.
The Chan School of Public Health studies large cohorts, including the Nurses' Health Study at Harvard and the long-running Framingham Heart Study at nearby Boston University. "These are well-phenotyped populations where you have sequencing data, so there is a lot of interest in calling pre-existing mutations in those populations to try to tease out some of the interesting things that might be associated with the phenotypes of people in the cohort," Chapman said.
"Most of the callers do well on the easier parts of the genome," Chapman noted. "When you get into the tricky parts, different ones can help you more than others."
The Chan School has been using DeepVariant as an ensemble caller along with the Broad Institute's Genome Analysis Toolkit, FreeBayes, and Strelka2, he said.
"We added [DeepVariant] in as another caller on top of that so that we have a different methodology. That gives us some more confidence in calls, where those other methods differ, or it can give us a bit extra sensitivity, depending on how you apply the ensembling of the different individual methods," Chapman said.
Harvard has an open-source toolkit called Bcbio-nextgen, an effort to make other variant callers work better together and create ensemble sets. "Now that [DeepVariant] is open-source, which I'm psyched about, we can include it in Bcbio and make it available to people," Chapman said.
Chapman hopes to be able to display DeepVariant results alongside those from other callers and new, more complex frameworks like one for artificial diploids. "The next step would be to wrap it and use the validation tools we already have to run [DeepVariant] alongside GATK and Strelka2 and FreeBayes, and then be able to have a side-by-side comparison," he said.
"Hopefully, what that will do is give us ideas for both how we can use it for more sensitivity and also if there's anything we can feed back to the DeepVariant team in terms of improving it," Chapman added.
Google's Brain team also believes DeepVariant has potential in nonhuman sequencing.
"The human sequencing market has pretty advanced tooling and the tools that exist today are pretty well parameterized for human data. What we've observed is that the impact of DeepVariant is actually disproportionately large as you move away from human data, where things are less well parameterized," DePristo said. He noted that the technology has shown promise in calling mouse and plant genomes.
"What is a significant but not astronomical increase in accuracy in humans becomes an enormous increase in accuracy in plants, not because DeepVariant is doing disproportionately well in plants," DePristo said. "We generalize very, very well so that our calls are similarly accurate in plants [while] existing tools are just much, much worse in these less-well-studied organisms in the genomics community. I think they will very likely get some of the earliest bang for their buck because they are just much less well-served today."