NEW YORK – The latest version of Illumina's Dragen genome analysis software is poised to take pangenome-based analysis mainstream, improving the ability to call all variant types, according to a new study published by the company in collaboration with researchers at Baylor College of Medicine.
Benchmarked against leading pipelines for calling SNPs and indels — the Broad Institute's Genome Analysis Tool Kit (GATK) and Google's DeepVariant — Dragen outperformed the other approaches in analyzing the Genome in a Bottle (GIAB) sample HG002.
Specifically, Dragen posted F-scores, a combined measure of precision and recall performance, of 99.86 percent, with 2,553 false positives and 8,610 false negatives. DeepVariant plus the Giraffe mapping tool posted an F-score of 99.64 percent with 3,695 false positives and 24,090 false negatives, while GATK plus the Burrow-Wheeler Aligner (BWA) had an F-score of 99.13 percent with 38,622 false positives and 29,163 false negatives.
Moreover, Dragen was able to call other types of variants, including short tandem repeats (STR), structural variants (SV), and copy number variants (CNV). For insertions larger than 50 bp, Dragen achieved an F-score of 76,9 percent, compared to 34.9 percent for Manta, an SV caller for short reads developed by Illumina. And for CNVs between 1 Kb and 10 Kb, Dragen performed better than short-read CNV-analyzer CNVnator, though performance for larger CNVs was more similar.
"I'm a big fan of comprehensive genomics," said Fritz Sedlazeck, a bioinformatician at BCM and a senior author of the paper, published last week in Nature Biotechnology. "This is an important milestone in bringing STR, SV, and CNV calling to a broader audience and to scale it in population studies or trios to enhance our understanding of these regions in diseases and different phenotypes."
The study also presented analysis results for over 3,200 samples from the UK-based 1,000 Genomes Project, where Dragen identified 116.3 million SNVs and 25 million indels. Performance on known SNVs and common indels was comparable to GATK; however, Dragen found millions more singletons and rare indels.
"This is a nice demonstration of the pangenome and a preview of things to come," said Michael Schatz, a bioinformatician at Johns Hopkins University, who was not involved with the study. "Suddenly, everything gets better," he said.
"I don't see it as a major threat to long reads, but it does help close that gap," he said, noting that the HG002 genome "is not a whole-genome benchmark," given that it focuses more on high-confidence regions and "leaves out some of the trickier parts" of the human genome that harbor repeats and SVs.
Sedlazeck, a former postdoc in Schatz’s lab, disclosed that Illumina provided computing credits for the study and that he has received funding from sequencing competitors Oxford Nanopore Technologies and Pacific Biosciences.
The study made use of Illumina's Dragen version 4.2, released in 2023 and updated to v4.3 in June of this year, a pipeline that is available for onboard computing with some Illumina instruments including the NovaSeq X Series and the NextSeq 1000 and 2000 systems. Illumina acquired the Dragen platform in 2018 when it bought Edico Genome. The concept makes use of specialized hardware to speed up analysis, called field-programmable gate arrays (FPGA).
"Similar to how a graphics processing unit (GPU) can accelerate the numerical processing for machine learning, an FPGA is much more efficient for data intensive parallel computing than a standard CPU, allowing them to cut the runtime by manyfold," Schatz said. Dragen was able to identify all the variants from raw data at 30X coverage in only 30 minutes, he noted, compared to about 24 hours for GATK on a server.
However, the reliance on FPGAs could also limit adoption. They're not commonly available on computing servers, Schatz said, and while they are available on the cloud, "not everyone will be willing or able to use cloud computing for their research."
Illumina also requires users to obtain a license to run Dragen. Sedlazeck said that as part of this study, Illumina agreed to make Dragen available to academic institutions under a special license. In an email, an Illumina spokesperson said the firm will offer a free trial license to Dragen that allows academic researchers to process 2,500 GB of sequencing data to reproduce results from the publication and to demo the software on their own projects. After that, labs would need to purchase a license. A 30X human genome could be processed for approximately $8, including license and cloud fees, when accessed through Illumina's managed cloud, she said.
In 2019, Illumina and the Broad Institute partnered to integrate GATK with Dragen. In July, a Broad blog post suggested it was still working on an official release of a "unified Dragen-GATK pipeline." However, Illumina said that "new features that were added after [Dragen version] 3.7.8 will not be integrated into external tools such as GATK."
Earlier this month, Broad Clinical Labs announced its whole-genome sequencing-based laboratory developed tests, which use Dragen for analysis, had been approved by the New York State Clinical Laboratory Evaluation Program.
Using Dragen "makes enough of a difference to justify it taking over," Sedlazeck said. "Our center is using it more and more across different studies and experiments."
Both Dragen and DeepVariant use pangenomes "in their best-performing workflows," said Benedict Paten, a computational biologist at the University of California, Santa Cruz and a leader in the Human Pangenome Reference Consortium. "With the second release of the pangenome now forthcoming, we can expect further improvements in these widely used tools."
For most users, the shift to pangenome-based methods will be invisible. Already, hundreds of thousands of genomes are being analyzed with Dragen through the UK Biobank and All of Us projects. But for anyone who was cautiously uncertain about its use, "this will be another stamp of approval," Schatz said.