NEW YORK – During beta testing for Element Biosciences' new sequencer last year, one of the customers quickly ran into a problem when trying it out with 10x Genomics' single-cell assays.
10x's Cell Ranger software, used for single-cell sequencing data analysis, was aborting runs and spitting out error messages. The reason? Element's data were, astonishingly, too accurate, with Phred quality scores at Q40 — a basecalling error rate of 1 in 10,000. That's an order of magnitude better than the enduring standard of Q30 accuracy, or an error rate of 1 in 1,000.
"At the time, no on-market compatible sequencer produced a QV [quality value] score higher than 40, so one of the quality control parameters in Cell Ranger was to consider QV scores higher than this as a probable error," Nigel Delaney, 10x's senior director of computational biology, said in an email. "This process treated such data as corrupt, ended the run and returned an error message."
10x quickly issued a one-line code fix, but the scenario highlights the potential for turbulence as companies like Element, Pacific Biosciences, and others ascend to new heights on the Phred scale with improved sequencing chemistries. While the instruments may be able to deliver lots of bases above and beyond Q40, the larger sequencing ecosystem may not yet be ready.
Higher accuracy will help unlock new uses for NGS — or at least help lower the cost of certain applications, such as rare variant detection in cancer — but from bioinformatics to sample preparation, Q40 data are posing — or exposing — problems. Software programs aren't optimized to use this data and library preparation methods may even be introducing errors that, until now, were considered background noise.
"We've gotten accustomed to looking at things through one lens," said Shawn Levy, chief scientific officer at Element. "That's true for library prep, that's true for software, and that's true for [estimating sequencing biases]." Taking an optimistic view, he said that means "we now have a chance to re-look at our tools and the opportunity to evolve those tools along with sequencing quality."
But just as these technologies are pushing the envelope on accuracy, some researchers are pursuing a new definition of the concept in sequencing, one based on variant calling performance, rather than per-base statistics.
"A lot of folks will talk about Q40 data as if it's the be all and end all," Element CTO and Cofounder Michael Previte said. "We know and have done the science to understand what that means and that it isn't the only thing that matters when talking about accuracy."
40 is the new 30
For years, Q30 accuracy has been a mythic threshold in sequencing. By definition, it represents a 99.9 percent probability that a particular base call is correct. More generally, this so-called Phred score is the output of an equation, where Q= -10log10(P), where P is the probability of an error. Thus, Q30 means a per-base error rate of 1 in 1,000. To be ultra-precise, sequencing instrument spec sheets list the percentage of bases to be expected at or above a certain Phred score. Illumina says its NovaSeq 6000, the paragon of short-read sequencing, with an S2 flow cell can deliver more than 85 percent of bases at Q30 or higher in 2x150 bp runs. (Paired end sequencing is, on its own, another way to improve sequencing accuracy.)
High accuracy is important for most sequencing applications: for whole-genome sequencing, it enables high-quality assemblies, while for targeted, needle-in-a-haystack use cases, it boosts the confidence with which one can say they've found something different or important.
"In general, accurate is almost always better," said Chris Mason, a sequencing expert at Weill Cornell Medicine. "If you want to look at really rare alleles or minimal residual disease, you need either lots of reads or really high accuracy, so if you see it once or twice you can be sure it was real."
Getting Q40 data can help with mapping cancer-causing variants or with early-cancer detection, as well as human leukocyte antigen (HLA) phasing for use in immunology research or organ transplantation in the clinic.
Levy suggested that highly accurate reads are also valuable for nascent "low-pass sequencing" applications, where the coverage isn't always available to provide a strong consensus. "That's where you start seeing the value of Q40 reads," he said.
With new approaches to short-read sequencing, Element's "sequencing by avidity" and PacBio's "sequencing by binding," part of its $800 million Omniome acquisition in 2021, are offering a new paradigm of sequencing accuracy.
Element's November preprint on its method showed data from a run where 96 percent of base calls were above Q30, with 85 percent above Q40, topping out at Q44. And last month, PacBio showed data from two internal testing runs of its new short-read sequencing platform Onso that suggested 90 percent of bases had quality scores well above Q40, or 99.99 percent accuracy.
Some Onso beta customers, like Mason, are even seeing a majority of bases over Q50, or 1 error in 100,000, over the length of a 100 bp read. Recent unreviewed data from Mason's lab suggest greater than 85 percent of bases are in the range of Q50 to Q55.
Illumina's new XLeap-SBS chemistry may also be able to achieve these types of Q scores. However, at its launch in September the NovaSeq X — the first instrument to run the new chemistry — had the same accuracy specs as its predecessor.
Even long-read sequencing methods, once maligned as being error prone, can deliver reads above Q40, thought not at the same consistency as short-read platforms. At last month's Advances in Genome Biology and Technology annual meeting, University of California, Santa Cruz researcher Karen Miga presented data showing accuracy plots from reads generated using PacBio's HiFi and Oxford Nanopore Technology's Duplex sequencing methods, each with peaks near Q40, though the center of the distributions were closer to Q30.
Does the sequencing field place too much emphasis on Q scores? "Yes and no," said Adam Phillippy, head of the Genome Informatics Section at NHGRI. "There's not too much importance placed on quality, but the devil is in the details."
Q scores are an average, he noted, and the distribution of scores can hide biases towards certain errors. These new technologies, in general, have some weaknesses sequencing longer repeats, Justin Zook of the National Institute of Standards and Technology and co-leader of the Genome In a Bottle (GIAB) consortium, said in an email. Understanding where the new methods aren't as accurate will be important, he added.
But the new methods also make different errors than previous ones, which is potentially helpful in showing bioinformaticians where blind spots had been lurking. Long homopolymers, regions of the genome that frequently repeat the same letter, are one such area where new methods could be helpful, Zook said.
Upstream and downstream
Simply using reads with higher Q scores won't change genomics instantaneously, however. Beyond the Cell Ranger instance, Element officials say that bioinformatics pipelines need to adjust for researchers to get the most out of better data. "We were curious as to why we weren't seeing as much significant benefit as we expected," Element's Previte said.
The tricks bioinformaticians have used to deal with errors, such as soft-clipping — masking bases that do not align to a reference — or read filtering are based on assumptions that were useful when dealing with Q30 data, but maybe not with Q40. "There's even assumptions about the read depth needed to call a certain variant," Levy said.
On the other end of the sequencing workflow, better accuracy is exposing errors introduced by sample preparation, which had been masked by amplifying DNA in the SBS process.
"As soon as you introduce PCR, the Q scores start to deteriorate," Previte said. Element was close to specifying that Q40 data would only be available with PCR-free library preparations, he said, although he suggested that the firm found a sample prep solution to be confident in making its accuracy claims.
Sample prep chemistry developers are aware of the issues. "We do see opportunities for improvement based on accuracy and other quality metrics," a New England Biolabs spokesperson said in an email. The company has been collaborating with several sequencing firms to ensure compatibility between the instruments and sample prep products and sees the advent of Q40 sequencing as a way to put the spotlight on high-fidelity polymerases.
How long it will take for upstream and downstream technologies to adjust remains to be seen. Phillippy suggested that any software issues would be easy fixes. "Higher accuracy makes everything easier," he said. "Basecallers like [Google's] DeepVariant would have to be retrained on the new data types, but if it's just higher accuracy it'll just work better."
A wonderful cycle
While companies are pushing the technical performance of their sequencers, some researchers are rethinking what it means to call sequencing "accurate."
According to Element's Previte and Levy, Q scores are only one facet of accuracy. In internal experiments, Element scientists simulated perfect reads to run through analysis pipelines. "It allowed us to calibrate our expectations of Q40 data," Levy said. "The most important lesson was that accuracy can't stand in isolation."
Not only did library prep and alignment affect data quality, so too did other sequencing metrics like insert size (the length of DNA between sequencing adapters), whether a read was paired-end or single-end, and consistency across the length of a read, they said. (Q scores can often dip towards the end of a read.)
Separately, Phillippy is embarking on a new project that he hopes will reframe ideas about accuracy. Building off his work on assembling reference genomes with the human pangenome and Telomere-to-Telomere consortia, both of which he co-leads with UCSC's Miga, over the last few months Phillippy has begun talking about another "aspirational" project: the Q100 genome.
Aside from being a nice round number, it represents one error in 10 billion bases, or the level of accuracy needed to reasonably ensure a perfect 6 Gb diploid human genome. A perfect genome simply isn't possible right now, Phillippy conceded, noting that the ribosomal DNA repeat arrays are outside the read length of even the longest nanopore reads. But working towards a perfect genome could help create what he conceives as the first "comprehensive benchmark" for genome sequencing.
"My goal for Q100 is to make the benchmark the complete diploid genome," rather than a set of variants within a genome, he said. "Benchmarking to date has been based on unique, easy-to-call regions of genome. As long as that's the standard, that's what drives progress. If you're only tested against the easy parts, it's easy to get a good score."
Specifically, Phillippy wants to make GIAB sample HG002 the first perfectly accurate human genome. Already, he has begun discussions with Zook about the project. An early return could be a model for the tradeoff between read length and accuracy, Phillippy said. And "including the nasty bits," such as centromeres, will push technology developers to create better sequencing methods that can call everything in the human genome, including structural variants.
The benchmarks will improve the sequencing technology, which in turn will help polish the benchmarks. "It's all a wonderful cycle," he said. "Until everything is perfect."