Skip to main content
Premium Trial:

Request an Annual Quote

Google Touts Speed, Accuracy From Machine Learning in DeepVariant


CHICAGO (GenomeWeb) – Evidence published in the journal Nature Biotechnology in September demonstrated the efficacy of DeepVariant, Google's deep-learning-based variant caller, compared to older previous methods of calling genomic variants.

DeepVariant "replaces the assortment of statistical modeling components with a single deep-learning model," according to the paper, whose authors represented Google and sister company Verily.

"The learned model generalizes across genome builds and mammalian species, allowing nonhuman sequencing projects to benefit from the wealth of human ground-truth data. We further show that DeepVariant can learn to call variants in a variety of sequencing technologies and experimental designs, including deep whole genomes from 10X Genomics and Ion Ampliseq exomes, highlighting the benefits of using more automated and generalizable techniques for variant calling," the authors said.

According to the researchers, DeepVariant outperformed both the Broad Institute's Genome Analysis Toolkit on the NA12878 sample from the Illumina Platinum Genomes dataset and the standard whole-genome sequencing of 35 replicates of NA12878. "DeepVariant produced more accurate results with greater consistency across a variety of quality metrics," they reported.

An early adopter of DeepVariant, Color Genomics, further validated the technology in a poster presented in San Diego this month at the annual meeting of the American Society of Human Genetics. Shortly after Google introduced an open-source update to DeepVariant in December 2017, Burlingame, California-based Color became the first clinical genetic testing lab to add this variant caller to its clinical processing pipeline.

"When they began testing DeepVariant, we began running internal tests to see how it might improve the value that we deliver to our clients," said Color Vice President of Engineering Jeremy Ginsberg.

Notably, Color looked at the 59 genes that the American College of Medical Genetics and Genomics recommends that laboratories report secondary findings from when performing clinical exome and genome sequencing tests.

The company then sequenced samples from Coriell Institute cell lines, as well as 7,000 of its own samples, and called variants with DeepVariant and other, more traditional callers.

"Looking at this dataset, we found 15 variants across the 59 ACMG genes that were not detected by any other callers," Ginsberg said. "Those 15 variants, although we didn't confirm them with external labs, visual inspection suggests that they are indeed present," he added.

According to Ginsberg, DeepVariant has extra sensitivity in regions with high guanine-cytosine content, and the ACMG-59 set has 3.4 times the number of high-GC base pairs as does a panel of 30 genes associated with hereditary cancer that Color also tested. However, the DeepVariant caller did not produce so many calls that researchers got overwhelmed.

"You don't want to generate so many candidate variants that internally you're drowning in false positives. We did not have that problem," Ginsberg said.

"The limited number of novel variant calls detected only by DeepVariant suggests a limited impact on our downstream workload," Color Genomics added in the ASHG poster.

Color ran its experiment with version 0.6 of DeepVariant, or the third incremental release since Google introduced the open-source code as version 0.4 nearly a year ago. The research community actually forced Google's hand on the release.

The initial manuscript that Google submitted to Nature Biotechnology described the version of DeepVariant that won the PrecisionFDA Truth Challenge award for highest SNP accuracy in April 2016, according to Mark DePristo, Google's head of deep learning for genetics and genomics and a member of Brain, a deep-learning project in the Google AI division. However, journal reviewers criticized the technology for not being open-source.

"We took about a year to fully rewrite the whole thing from the ground up so that it could run open-source and inside of Google using the latest and greatest deep-learning tech that we have," DePristo said.

The Brain team programmed the revamped system in TensorFlow, a library of open-source programming code for numerical computation that is popular for deep-learning applications, and optimized it for hardware accelerators like graphics processing units and Google's own tensor processing units. Google designed TPUs specifically for neural network machine learning; DeepVariant contains what is called a convolutional neural network.

"That software architecture that we have chosen is probably the one that it will live in for quite some time," DePristo said. "It looks like a standard bioinformatics tool now. You can run on a single machine, you can run it on premise. If you run it in an environment where there are GPUs or TPUs, it can make use of those."

The Nature Biotechnology paper describes technology up to and including version 0.4, according to DePristo. The open-source iteration of DeepVariant has shown significant increases in speed and accuracy, and there have been improvements in each subsequent release.

Version 0.5 introduced "production-grade" support for exomes, DePristo said. "That was a significant improvement in the accuracy compared to 0.4."

In a blog post announcing that release in April, Google said that 0.5 produced 43 percent fewer indel errors and 22 percent fewer SNP errors in whole-exome sequencing than traditional variant callers.

Version 0.6, which is what Color relied on for its recent poster, added support for polymerase chain reaction-positive samples. "Our collaborators had noted that our performance was disproportionately worse on PCR-positive samples, so we added some new training data to the training pipeline that included PCR-positive samples. Now, that's radically more accurate," DePristo said.

For the most recent release, 0.7, Google decided to focus on speed and what DePristo called "cost optimization."

"DeepVariant is now three to four times faster than it was in 0.6," DePristo said. "If you're using TPUs for the evaluation, it's also quite a bit cheaper."

He said that computing costs for processing a whole genome at 30X on Google Cloud are about $6 with version 0.6. With 0.7, the price falls to $2.

"The turnaround times are potentially even more significant," as that is important in the clinical realm, according to DePristo.

Running the current version of DeepVariant on a traditional CPU takes about 10 hours. It's just 20 minutes with a TPU. While acquiring TPU time is more expensive, the significantly faster turnaround makes a TPU three times more cost-efficient than a CPU, he said.

DePristo reported adoption in the last few months from clinicians as well as from laboratory professionals and researchers. "We've seen quite a lot of pickup, for instance, in a variety of agricultural areas," he said, including the International Rice Research Institute, a Philippines-based organization that has used the technology to call rice variants.

"At this stage, we don't necessarily think that any specific niche is the target. What we really want to do is work with people and understand what are the advantages in their areas" of how DeepVariant can best help them, he said.

DePristo also discussed SVAI, a collaborative community of programmers in Silicon Valley that seeks to introduce artificial intelligence into computational biology and biotech, often through hackathons. For example, a recent hackathon looked for a way to improve variant calling on cancer genomes sequenced on BGISEQ-500 instruments.

"DeepVariant team members … retrained DeepVariant on BGISEQ data as part of the hackathon and released a radically improved variant caller for BGISEQ in 24 hours," DePristo said. "It shows what is possible with something like DeepVariant, where it learns to correct for itself if you have more training data, so that was very exciting to see."

While quickly retraining algorithms may be possible with deep learning, DePristo sees an educational challenge ahead. "This is when you are confronted with new data types that you haven't explored, but this is not such a commonly thought-of capability in the community," he said.

"How do we get people to a state where they see, for instance, how best to leverage the capabilities of deep-learning tech in their genomics workflows?" DePristo wondered.

"We're trying to explain that there are all sorts of interesting things that you can do with this technology."