NEW YORK, Jan. 23 (GenomeWeb News) - When Applied Biosystems debuted its first automated Sanger sequencer in the 1980s, researchers had to wait 10 years before the base-calling program Phred emerged on the scene to assign quality scores to sequence data. But that was then, and this is now.
Emerging next-generation sequencing companies are well aware that they need a reliable program to assess their data. So, what should the program look like and who should develop it?
According to Jeff Schloss, program director of technology development at the National Human Genome Research Institute, one thing next-generation sequencing vendors need to focus on is being Phred-friendly.
"Phred has become the de facto standard for sequence quality," Schloss said. "Each new instrument will need to develop a standard that users can understand in the context of Phred quality scores."
Each new sequencing company will have to develop its own quality standards for its instruments. "To the extant that signals from various of the new platforms are similar, the quality measures may be similar," Schloss said.
Initially released in 1998, Phred uses a logarithmic scale and assigns a probability of confidence to let users know how confident they can be that the base identified by the software is biologically accurate.
454 Life Sciences is one company that decided to follow Phred's lead when developing its own software. To 454, Phred not only provides a sense of familiarity, but allows researchers to more easily combine data from their instrument with data from Sanger-based sequencers. "It was imperative that we give researchers a way of assessing the quality on the same scale," said Marcel Margulies, vice president of engineering at 454.
Margulies said 454 realized early on that quality standards had to be a priority, and wrapped up development of its own software, called Quality Score, a year ago. They then spent several more months validating their quality scores by comparing predicted versus observed quality scores across a large variety of genomes.
Like Phred, Quality Score uses a logarithmic scale, but since the chemistries and error models are different, the algorithms are different, too. The confidence of a particular base call from a 454 instrument is based on whether it is part of a homopolymer. 454's sequencer uses polymerase-induced synthesis to build a complementary sequence, adding one nucleotide at a time and monitoring pyrophosphate emission to determine when a specific base gets taken up. If, for example, there are two adenines in a row, researchers will see twice the amount of signal as they would if they had one adenine alone. Errors can be introduced in long homopolymers.
"If there are 10 A's in the sequence in the next extension step, it is hard for the instrument to tell whether there are 10 A's or 9 A's," said Gene Myers, who is on the scientific advisory board at 454 and a group leader at Janelia Farms Research Campus. "There is an internal model built on the probability of seeing a signal of a certain level based on there being 9 A's versus 10 A's, and then the Phred-like number expresses that in a natural mathematical way."
Other sequencing companies have their own challenges. For Solexa, because its sequencer works in a completely different way from 454's, its program for measuring quality, which it calls Bustard, is dramatically different. Solexa developed a Phred-like base-call scoring scheme for its raw read data, "except that for each base we assign a score to each of the four possibilities ... rather than one," Clive Brown, director of computational biology and IT at Solexa. "This gives more information for error rate estimation, correct consensus calling, etc. This set of four numbers can be easily converted into exactly the same quality scoring system that Phred uses."
Helicos Biosciences, on track to being third in line for releasing an instrument, is also working on a quality strategy but is keeping it under wraps. "It's our policy not to discuss product specifics prior to release," said Stanley Lapidus, CEO of Helicos Biosciences.
Public or Private?
While everyone agrees that quality standards for next-generation sequencing data are essential for scientists to embrace the new instruments, not everyone agrees these assessment tools should come from the vendors. Chad Nusbaum, co-director of the Genome Sequencing and Analysis program at the Broad Institute, pointed to the fact that Phred didn't emerge from ABI, but from the research community -- specifically scientists from the
"My hope is that [quality measures] are going to come out of the community the same way [they] did with Phred," Nusbaum said. "It's best if these things grow out of the user community. I think for any kind of quality scores to have the confidence of the community, they have to be an academic enterprise."
Nusbaum is interested in coming up with quality scores in conjunction with other researchers. "People are thinking about it in a number of places, and I think we just have to put our heads together," he said.
But is this duplication of efforts really necessary?
Green said Phred came out of the research community because ABI, back in the 1990's, didn't have a data-quality program - there was a huge gap that needed to be filled.
"There are a lot of competent people at these [next-generation sequencing] companies and it could well be that they have a group that can develop appropriate quality measures, and they're good enough that there is no reason for anybody else to work on it," said Green, who is also a Howard Hughes Medical Investigator. He believes, however, that if companies develop their own quality measures, they should allow scientists access to the raw data. "I think the research community should have the opportunity to develop quality measures, and to do that, they need to get access to the raw data," he said.
So far, it seems emerging companies are trying to be as open as possible. 454 said all raw data is available to users, and Solexa said it consulted with the research community when it developed Bustard.
"We are developing this in conjunction with public domain and academic experts in the field in order to ensure that we have the best, most openly and widely accepted data possible," said Solexa's Brown. "In many ways we seek to encourage the processes that gave rise to the Phred phenomenon."
Kate O'Rourke covers the next-generation genome-sequencing market for GenomeWeb News. E-mail her at [email protected].