NEW YORK – The results of a major, core facilities-driven benchmarking study for next-generation sequencing platforms are in, and just about every major player in the field can claim a victory of some sort. The data support longstanding advantages touted by market leader Illumina, while also providing a sneak peak at the near future, with a strong showing from newer technology like the Genapsys platform and Oxford Nanopore's Flongle.
The multiyear Association of Biomolecular Resource Facilities (ABRF) NGS study, published Thursday in Nature Biotechnology, had multiple labs sequence the same human and microbial samples across multiple platforms from Illumina, Pacific Biosciences, Thermo Fisher Scientific, BGI, Oxford Nanopore Technologies, and Genapsys.
"We didn't want to get into people comparing one system to another," said Scott Tighe, technical director of the University of Vermont Advanced Genomics Lab and co-first author of the study. "For us, that’s the wrong intent of this data. It's more so that you can benchmark yourself against these datasets."
That said, the study afforded the authors the ability to proclaim leaders across several metrics. "Among short-read instruments, [Illumina's] HiSeq 4000 and X10 provided the most consistent, highest genome coverage, while BGI/MGISEQ provided the lowest sequencing error rates," the authors wrote. PacBio's HiFi sequencing "had the highest reference-based mapping rate and lowest non-mapping rate," and along with Oxford Nanopore, "showed the best sequence mapping in repeat-rich areas and across homopolymers." They added that NovaSeq's 2 x 250 bp chemistry "was the most robust instrument for capturing known insertion/deletion events."
Other positives from the study were the performance of the Genapsys platform, which is still not fully commercially available, and Oxford Nanopore's Flongle, an adapter device for the MinIon and GridIon sequencers that enables the use of lower-throughput, and cheaper, flow cells. "Flongle knocked it out of the park," said Chris Mason, a sequencing expert at Weill Cornell Medicine and a senior author of the study.
Overall, the study provides tons of data for any number of comparisons that could be useful for researchers. New methods, including bioinformatics tools, could be validated against these data, and there's even information on how much technical variation was introduced by using different methods, or the same method in different labs.
"As we enter new regulatory paradigms for molecular pathology, we have to get control of these simple things first, before we attack biological complexity," said Don Baldwin, a professor of pathology at the Fox Chase Cancer Center and a senior author of the paper. "The good news is that a lot of these things are highly consistent."
The recently published paper is phase two of the ABRF study and follows a benchmarking project of RNA sequencing methods published in 2014.
Here, researchers at approximately 30 sites performed whole-genome sequencing of a sample from the HapMap project — a son that is part of a family trio — using the Illumina NovaSeq and three HiSeq instruments; BGI's BGISEQ-500 and MGISEQ-2000; PacBio's HiFi protocol; Oxford Nanopore Technology's PromethIon, MinIon, and Flongle; the Genapsys platform with version one chemistry; and whole-exome sequencing on the Thermo Fisher Scientific Ion Proton and S5 instruments. The study also perfomed WGS for the mother and father in the trio using the Illumina, BGI, and PacBio platforms.
For bacterial genomes, study participants sequenced a metagenomic sample using the Illumina MiSeq, Thermo Fisher Ion PGM and S5, Oxford Nanopore MinIon and Flongle, and Genapsys.
"Illumina is committed to helping the ABRF community to characterize and promote the utility of genomics," Gary Schroth, VP at Illumina and one of two study authors affiliated with a sequencing manufacturer, said in a statement. "This was a very strong study led by Chris Mason and a team of wonderful researchers, and we were very happy to support and participate."
BGI was the only other NGS provider that had an employee contributed as a study coauthor. "While all companies provided advice and reagents, the listed co-authors actually did enough analysis and/or planning and logistics to warrant authorship," Mason noted.
"This comparative analysis also shows that MGI platform (library preparation and sequencing) provides top quality WGS, both in sensitivity and specificity of variant detection. These advantages are very important for genetic research and diagnostic applications," a BGI spokesperson said in an email. They added that the WGS data in the paper are based on PE150 reads but that the MGISEQ-already has the capability to do 200 bp reads and a 300 bp kit is in development.
PacBio, Oxford Nanopore, Thermo Fisher, and Genapsys did not immediately respond to request for comment.
The study also provides some of the first public data on the performance of the Genapsys instrument. Tighe noted that he "kind of nudged them" to participate in the study and cautioned that it used the first version of the Genapsys sequencing chemistry, which by now has been replaced at least twice, leading to a lower depth of coverage that the system would deliver today. He said that in the future, the Genapsys chemistry is expected to be faster, more automated, and will generate clusters of DNA on the instrument.
"When I look at the Genapsys data, I'm pleased at how similar it seems to be to the MiSeq data and S5 data," Tighe said. "They're all very different read technologies, but they all seem pretty consistent."
"I'd agree, it was very similar," Mason said. "The indel profile was slightly different. Not more [indels], just different."
"If anything, it makes it seem like with a lot of the platforms, if you needed to, you could pick any of them," he said.
The study leaders also claimed that they had published the largest Flongle dataset to date. Tighe noted that the Flongle flow cells used were obtained under a "two-for-one" promotion run by Oxford Nanopore after users experienced quality control issues that led to lower-than-advertised yield.
Compared to a MinIon 9.4 flow cell from Oxford Nanopore, "their bacterial species profile is remarkably close," though the Q scores were slightly lower. "We see Q scores of around 11 on those and see Q13 on the 9.4 [flow cell]," Tighe said.
Mason also suggested that the study was the first demonstration of 2 x 250 bp chemistry on the Illumina NovaSeq at high throughput. "The data looked really good for structural variants," he said. "That's not surprising, though — with longer reads, we can see those better."
In addition to sequencing platforms, the study compared several different variant callers, including Google's DeepVariant, the Genome Analysis Tool Kit HaplotypeCaller, and Sentieon Haplotyper — a commercial GATK emulator.
"I don't want to use the word 'winner,' but DeepVariant had the best precision and sensitivity," said Jonathan Foox, first author of the study and a postdoc in Mason's lab at Weill Cornell Medicine. The caveat is that it's a machine learning-based algorithm and was trained on the cell lines sequenced in the study, he cautioned, adding that Sentieon's performance was close to the others and had good computational efficiency.
The study also supports some conventional wisdom in the sequencing field, such as the belief that long- and short-read approaches will ultimately complement each other in a clinical setting. For example, Baldwin said that high-depth Illumina reads could be polished with small panels of nanopore-based long reads.
Going forward, the study leaders stressed, they hope others would find ways to compare new sequencing data to theirs.
"These data and protocols can be used as a roadmap to benchmark any new platform," Tighe said. Mason added that because of the trio sequencing approach and a focus on well-validated genomic variants, the data could also be used to validate new bioinformatics tools.
Baldwin suggested that the metagenomic data could help develop clinical metagenomics and even inform other uses of clinical sequencing. "Just knowing how platforms perform in different GC contexts is important for human sequencing," he said. Using that data to improve the quality control of clinical sequencing could also help minimize costs.
"We're always trying to get as close to a real ground truth as we possibly can when studying the human genome and its composition," Foox said. "Although this doesn't solve anything directly, it will help create a robust and reliable baseline that we can compare any individual genome against."