Researchers at the Broad Institute have developed new methods designed to more accurately define quality scores for next-generation sequencers, and to detect SNPs using next-gen data.
The scientists applied their quality score determination method, which could be used for several new sequencing platforms, to 454 Life Sciences’ sequencing system and improved the quality scores provided by the vendor. Better quality values, they and others argue, could improve the accuracy of results gained from even low-coverage data, and might eventually help users decide which next-gen platform to use for a certain application.
Quality values or quality scores state the uncertainty of the data, or the likelihood that a base call is incorrect. For example, the phred algorithm assigns a quality value for each base in a Sanger read in which larger numbers designate smaller error probabilities. A Q20 value, for example, corresponds to a 1 in 100 error probability, and a Q30 value to a 1 in 1,000 error rate.
The reason why quality values need to be derived differently for each next-gen sequencing platform is that they each have different error profiles, according to Todd Smith, CEO of Geospiza, a sequencing software company. “Even though we get Gs, As, Cs, and Ts at the end, the instruments are fundamentally going about that process differently,” he explained.
While for some next-gen sequencers, such as Illumina’s Genome Analyzer, the interpretation of the signal and the meaning of the quality values is relatively close to capillary sequencers, for others, like 454’s Genome Sequencer, ABI’s SOLiD, and Helicos’ sequencer, they are very different from Sanger sequencers, according to Gabor Marth, an assistant professor of biology at Boston College. His group recently published a new base-calling program, called Pyrobayes, that produces more confident base calls than 454’s own program.
Phred-based quality scores, he said, are “just not really descriptive enough” for some of the next-gen systems. For example, he said, it is difficult for them to take into account deletions and insertions, which are the dominant types of sequencing error of the 454 platform.
Marth said one possible solution would be to assign several base quality scores instead of just one. However, many software programs that analyze sequence reads only take into account a single base quality value, so it would be “a waste” to use several, he added.
The Broad researchers used the phred algorithm to combine different error predictors for the 454 platform into a single quality score, and applied it to large training data sets for which the true DNA sequence was known. They then compared the predicted base qualities with the actual ones.
They published their method last week online in Genome Research.
Compared with the quality values provided by 454’s own software, the Broad’s scores are more accurate and yield more high-quality bases, according to Jared Maguire, a computational biologist who leads the subgroup for new sequencing technologies within the Broad’s group for computational R&D, which published the method.
454 incorporated the Broad’s quality scores as its default scores in its latest software update, which debuted in early February, according to a company spokesman. 454 also uses the quality scores in its mapping and assembly software, which is included with the instrument, resulting in “better assemblies,” he said.
The reason that vendor-supplied quality scores — at least those from 454 and Illumina — are less accurate than those determined by the Broad is because vendors typically don’t have access to the same amount and variety of training data that the Broad is able to produce, Maguire said. “From the point of view of the vendor, they have so much on their plate, they are really just trying to get a working system out the door,” he said.
But not all next-generation sequencers will likely benefit from the Broad’s recently published method. For instance, the program “might not necessarily be the ideal algorithm” for Illumina’s platform, Maguire said. Therefore, he and his colleagues have also developed a new algorithm that they have applied to Illumina sequencing data.
The program, which Maguire presented at last month’s Advances in Genome Biology and Technology meeting in Marco Island, Fla., uses a different model “that can handle larger amounts of data [and] more varied data,” he said.
According to a description of the program in the conference abstract, the software accounts for many features of the Illumina platform, such as signal level and purity, read position on the sequencing array, local image quality, readout from spiked internal controls, and sequence context.
Both Broad methods could also be applied to other systems, such as ABI’s SOLiD or Helicos’ HeliScope, according to Maguire. “We try to stay pretty agnostic about the [sequencing] system,” he said.
“To say that I have a sequence difference at a particular spot ... I need to be pretty certain that what I am seeing in my data is true.”
“I imagine each platform will have a slightly different algorithm, [but] the methods for developing that algorithm will follow a common path,” Geospiza’s Smith said. “You develop many test sets, you develop a panel of quality values, and then you compare them to the truth you are observing.”
The Broad’s methods are also applicable to adjust quality scores after updates in the hardware or chemistry of existing next-gen sequencing platforms. “It just means that you have to collect new data on the new hardware and retrain,” Maguire explained.
After analyzing data from Illumina’s platform, Marth’s group found that the system’s error rate goes up as the number of sequencing cycles increases. Overall, he said, the company's base quality values “are appropriate for” the Illumina platform, but he proposed to recalibrate them to make them more accurate.
Since quality values are “really close to the machine,” he said, vendors “should focus on this more than they are focusing on now.”
Illumina has its own internal development program “to increase the accuracy of error estimates, based on a number of different run statistics,” according to Jordan Stockton, Illumina’s market manager for computational biology. The company, he said, “is leveraging its recent expertise in human whole-genome sequencing to identify the greatest opportunities to improve quality scoring.”
Illumina also has several collaborators who are “actively looking at what types of experimental factors contribute to the accuracy of a given sequence read,” Stockton told In Sequence by e-mail this week. He said Illumina provides them with predictive tables that the company uses to make quality estimates and gives them the tools to build and implement their own predictor tables.
ABI, for its SOLiD system, determines a quality score for each color call, which comprises two bases, following the phred convention, according to Michael Rhodes, ABI’s senior manager of product applications for the SOLiD. It also assigns a score to each read, based on the average quality values.
Although users are converting the quality scores for the color calls into phred scores for individual bases, using averages of two probabilities, the company recommends that users stay in “color space” until they generate a consensus sequence. “Then a variety of methods can be used to calculate the quality value,” Rhodes said.
Rhodes told In Sequence that ABI is presently unaware of outside groups working on base callers for the SOLiD system but said that the four intensity values for the color calls are available to users who want to work with them.
Experts agree that accurate base quality values are more important for some applications than for others.
Quality values matter especially for determining rare base variants, for example, or for calling SNPs with confidence when the sequence coverage is low. “To say that I have a sequence difference at a particular spot, and this is occurring at a very low frequency in a population … I need to be pretty certain that what I am seeing in my data is true,” said Smith.
Obtaining accurate results at lower sequence coverage could lower the cost of sequencing. “You don’t need the same kind of depth of coverage because you can draw the same conclusions just as accurately for much less data” using accurate base quality scores, Marth explained. “You know which reads to trust and which ones not.”
Also, quality values are important in de novo sequencing applications “for eliminating junk” where no reference is available, according to Maguire.
On the other hand, so-called counting applications, like ChIP-sequencing or digital gene expression, are less reliant on quality values, as long as the reads can be aligned to the correct location in the genome.
Finally, a “common language” to evaluate the quality of sequencing data “is essential to compare results from different systems and to make sensible decisions about which sequencing method is suitable for each application,” the Broad scientists write in their article. However, Maguire cautioned that “we are not at that stage yet” with the new technologies.