Researchers from the Marine Biological Laboratory at Woods Hole have devised a quality-control method for weeding out and discarding bad 454 sequence reads, thus increasing the accuracy of the data the platform produces.
According to the researchers, the method will be particularly useful for microbial diversity studies, where accuracy is especially important because researchers may need to interpret single reads without building consensus sequences.
“We think, actually, that 454, if we are not greedy, is remarkably robust as a sequencing technology,” said Mitchell Sogin, director of the Josephine Bay Paul Center at MBL and a co-author on a paper describing the method that was published last month in Genome Biology.
The MBL scientists found even the raw read accuracy of the Genome Sequencer 20 to be higher in their hands, 99.5 percent, than in the original study published by 454 in Nature in 2005, where researchers observed an accuracy of 96 percent in a sequencing project. Using their QC method, the MBL team increased the read accuracy to more than 99.8 percent.
The researchers started out by determining the accuracy of the 454 system in their own laboratory, prompted by a study last year in which they analyzed a liter of seawater using 454’s GS20. Comparing sequences of the V6 hypervariable region of rRNA, they found on the order of 25,000 different bacterial species, about ten times more than anybody had reported before.
However, the scientists, who published their results in PNAS a year ago, could not be sure whether this great diversity was real or a result of sequencing errors introduced by the 454 platform, Sogin said. “So we decided to do a more rigorous type investigation about what was the real error rate of 454.”
A low sequencing error rate is especially important for the microbial diversity projects that his lab focuses on, he said, which range from marine biology projects to analyses of microbial populations in mice and chickens. These studies differ from whole-genome sequencing projects, where scientists sequence the same regions many times over and build more accurate consensus sequences, to which they can compare individual reads. “We don’t have that option in an environmental sample,” Sogin said. “Every read means something.”
For the Genome Biology study, the researchers chose 43 reference templates from divergent bacteria, which they had sequenced several times in both directions using Sanger sequencing. “We knew exactly what the sequence was for each of those templates,” Sogin said. They then sequenced PCR amplicons from these templates using 454’s GS20 and compared the 454 reads to the known sequences.
The raw read error rate was 0.5 percent, better than the 4 percent reported by 454 in its 2005 Nature publication. Eighty-six percent of the reads had no errors at all. Sogin said the improvement is likely due to updates in 454’s base-calling algorithm and image-processing code, and may also relate to changes in the system’s chemistry.
However, despite the improvement, even an error rate of 0.5 percent “could be significant in an environmental metagenomics study,” Sogin said.
Sogin and his colleagues then looked for ways to identify “bad” 454 reads. They found several:
Some reads were longer than expected, and tended to contain a lot of errors. These long reads most likely result from multiple templates attaching to the same bead, Sogin said. While most researchers want to push the read length as much as they can, he cautioned that there might be a risk in including the longest reads. “If you are getting a really nice long read, it might be full of errors and you won’t know,” he said.
Secondly, reads that were shorter than expected were also error-prone. These might be aborted reads or those that the 454 software trimmed at the end because they did not pass its quality standard. Researchers are well advised to exclude such short reads, too, Sogin said.
Finally, reads that included one or more ambiguous, or uncalled, nucleotides — indicated by an “N” in the sequence — within the sequence window where the instrument performs well also had a high error rate. “It turns out that that is a very important discriminator,” Sogin said.
Ns are introduced when the software cannot decide which base to call, because the signal intensities are all the same. “It knows there should be something there, but each of the bases came out negative,” explained Susan Huse, a research associate in Sogin’s lab who led the study. The reason for uncalled bases could be mixed templates, poor priming or elongation, or other sequencing problems.
“Their work convinced me that 454 can be much more useful for diversity studies than we had imagined.”
After removing all reads containing at least one N — in the MBL researchers’ case, 6 percent of all reads — the error rate of the remaining reads decreased to 0.25 percent. After further removing 1 percent of all reads that were particularly long or short, or had inexact matches to the primer, the error rate dropped to less than 0.2 percent.
The results, Sogin said, surpass the quality of Sanger reads, which he said typically ranges from 98 percent to 99.5 percent accuracy, mostly depending on the quality of the DNA template. What distinguishes Sanger from 454 reads, though, is the ability to statistically determine the quality of each individual base call in a Sanger read, he said. 454, on the other hand, cannot currently assess the quality of every base call, “although it does attempt to assign quality scores to stretches of nucleotides that reside within homopolymer stretches,” Sogin said.
While the quality of Sanger sequencing could also be improved by eliminating low-quality reads, that would be very costly, he said. Because the cost of 454 sequencing is so low, “it’s not a big deal to throw data away,” he said. “The trick is to recognize which reads to throw away.”
Sogin recommends that 454 users eliminate any reads that contain uncalled bases, or reads that deviate a lot from the average read length, even if that means losing 10 percent of the data. “You should not be uncomfortable with the idea of throwing away any reads that have one or more Ns present,” he said.
Jonathan Eisen, an evolutionary biologist and a professor in the Genome Center at the University of California, Davis, who was not involved in the study, said in an e-mail message that his group, like Sogin’s, has used 454 sequencing for microbial diversity studies but was “quite worried” about the quality of the data. That changed when he heard Sogin give a talk about his new method to improve the accuracy.
“Their work convinced me that 454 can be much more useful for diversity studies than we had imagined,” Eisen said. However, he believes that Sanger sequencing will still have its place because of its longer reads, the ability to obtain paired reads more easily, and the fact that the clones required for Sanger sequencing can be used for other purposes as well.
454 is not planning to incorporate Sogin’s method into its own software, according to a Roche spokesman, who noted that the company gives GS FLX users access to source code files for many of its applications, so they can integrate and develop their own tools. The company does not provide access to the instrument control software or other software needed to run the instrument, though.
Sogin said 454’s software lets users configure the number of uncalled bases they are willing to tolerate in a read, and to adjust what the minimum or maximum read lengths should be. However, it takes some effort to customize these parameters. “Only users who are really into the system [can] find out where those configuration files are,” he said.
Sogin’s lab received a GS FLX as an upgrade to the GS 20 several weeks ago. He said he does not yet know whether the FLX’s longer reads will improve the accuracy of the system for his projects, but he guessed that the number of reads that pass the quality control step will increase.