This article has been updated from a previous version to clarify details about the raw error rate of the SOLiD system as opposed to color-corrected data from the system.
To help researchers handle the wealth of short-read data from second generation sequencers, and the Applied Biosystems SOLiD sequencer in particular, several academic developers have developed new alignment tools and are also adding features to more mature software.
Both ABI and Illumina have been encouraging third-party software development for their second-generation sequencers. Illumina, for example, earlier this year extended its Illumina Connect partnership program for third-party bioinformatics providers to support software development for its Genome Analyzer [BioInform 02-22-08], and Applied Biosystems last year launched a website to support software development for the SOLiD sequencing platform and to offer its own tools for the system. [BioInform 09-07-07].
In July, ABI released the first open source software tool for SOLiD that it had not developed internally: SHRiMP, the Short Read Mapping Package, developed at the University of Toronto by graduate student Stephen Rumble and computer scientist Michael Brudno in collaboration with Arend Sidow and his lab at Stanford University. Brudno is currently adding new features to the tool.
This month, two new tools joined the site: SOCS, Short Oligonucleotides in Color Space; and BFAST, for Blatlike Fast Accurate Search Tool.
Not a Shrimp
Brudno told BioInform that since July there have been 401 downloads of all SHRiMP versions, and 171 downloads of the latest version, 1.1.0.
At the moment he and his colleagues are working to enhance the algorithm.
“SHRIMP is expanding in several directions, we are making the underlying algorithm better, trying to speed it up, make it more accessible to a larger array of users.” he said. For now you have to have “significant computational power if you want to run SHRiMP for human genome datasets,” he added.
Bruno said he also wants the tool to work for RNA sequencing, which “is going to be really big.”
SHRiMP is made up of several programs that search for alignment, analyze alignment statistics, and offer the user visual presentations of the alignment. As outlined in the software’s readme file, the algorithm starts with a k-mer hashing step to locate areas that are similar between the reads and the reference genome. One feature of this algorithm is that it indexes the reads rather than the reference genome, thus creating a tool for which memory use is independent of total genome size.
Brudno has a compute cluster of 50 machines at his disposal, which facilitates genome-wide studies. “But we want to make [the software] accessible to someone who has 10machines,” he said.
Other SHRiMP changes underway include pair-end mapping and making the algorithm “splice-site aware,” he said.
First on Brudno’s to-do list is the algorithm improvement stage, propelled mainly by head developer Rumble’s imminent graduation and move to Stanford University for his PhD, “so he needs to finish up his SHRiMP work in the next couple of months,” Brudno said.
Other programmers are working on the mRNA mapping functionality. “Stage 1 will be the final algorithmic version of SHRiMP and then we will then tweak to get RNA sequencing going,” he said.
Brudno said SHRiMP is “quite mature” and that he has been “talking to a few companies interested in building some extensions” on top of it.
He did not disclose the name of those companies, however, noting that plans have not yet been finalized.
Pulling up SOCS
In the October 7 issue of Bioinformatics, Nicholas Bergman and his colleagues at the Georgia Institute of Technology School of Biology and the Electro-Optical Systems Laboratory at Georgia Tech Research Institute, described SOCS as a way to map SOLiD sequence data to a reference genome.
SOLiD’s two-base encoding scheme means that data is first collected in “color space,” in which the color provides information about two adjacent bases that must then be decoded into sequence data. As the Georgia Tech team indicates in the paper, the fact that each base is interrogated twice “helps in discriminating between sequencing errors and true polymorphisms.”
They also point out that the raw error rate of SOLiD data prior to color-correction is “significantly higher” than other sequencing platforms, which leads to read-mapping challenges since current tools are not mismatch-tolerant beyond three mismatches, and “also leaves a sizable fraction of each data set unused.”
An ABI spokesperson noted that the SOLiD data subsequently undergoes color-correction, using two-base encoding, which interrogates each base twice and is able to achieve 99.97 percent accuracy. The spokesperson added that the fact that SOLiD delivers unfiltered data accounts for the increases in the proportion of mappable data as compared to Illumina.
Bergman and his colleagues decided to create a tool that was more mismatch-tolerant and that would “maximize the number of usable sequences” in a data set.
SOCS is written in C++ and built on a variation of the Rabin-Karp string search algorithm, which uses hashing to accelerate the matching to the reference genome, and is similar to the algorithm used to analyze Illumina Genome Analyzer data, the scientists wrote.
“Short reads are a craze at this stage.”
According to the study’s authors, the software can use multiple processors and be implemented in a cluster. In the paper, they report mapping in 17 hours a 32 million read data set to a reference with a tolerance of four mismatches. With SOCS, users specify the mismatch tolerance, but it maps the lower tolerances first to reduce the data that must be mapped at higher tolerances.
In an e-mail to BioInform, Bergman explained that SHRiMP is “a lot more customizable” than SOCS, “but using the options properly requires knowledge of how the alignment works. There is a somewhat complex relationship between its sensitivity and speed that depends on various seeding parameters.”
SHRiMP also has more features than SOCS. For example, it can handle insertions and deletions, which wasn’t as important to the Georgia Tech researchers, Bergman said.
“SOCS will work well ‘out of the box’ without knowing what it's doing under the hood,” he said. “It was originally designed for transcriptome mapping, where it makes sense to map as many of the reads as possible in a reasonable amount of time.”
Bergmann added that SOCS is “straightforward” to use because users set a threshold for the number of mismatches to the reference genome that a read is allowed to have, “and it will map everything with that many or fewer — in other words, 100 percent sensitivity within the mismatch tolerance.”
“One thing we emphasized in developing SOCS was making it easy to get the most out of your system,” he said. “You can tell it how much RAM and how many processors to use and it will configure itself to run as fast as possible within those constraints.” It also provides users with maps of sequence census and mismatch census.
Being specific to the SOLiD platform, SOCS also takes into account the quality scores generated by a SOLiD run — when a read has more than one potential match within the tolerance, the quality of each color call is used in determining the optimal match, Bergman said.
Nils Homer, a graduate student in Stanley Nelson’s lab at the University of California, Los Angeles, has developed another short-read alignment tool called BFAST that is based on Jim Kent’s Blast-Like Alignment Tool, BLAT.
Homer’s lab was working on a large-scale sequencing and re-sequencing project and wanted to map reads in which up to 10 percent of the reads might contain errors. However, he said, “We found we couldn’t really find variants while we were using BLAT.”
The team developed an algorithm based on BLAT that offers greater accuracy, he said. “We also found a way to increase the speed by an order of magnitude as well.”
On a 20-node cluster, he explained he and his colleagues can map an entire human genome in a day. He said papers on BFAST are under review at the moment.
Like SOCS, BFAST can handle insertions and deletions in the SOLiD platform. “A lot of people are worried about SNP detection, [but] we’re also worried about insertions and deletions,” Homer said.
BFAST was first programmed for the Illumina Genome Analyzer and has been adapted for SOLiD’s two-base encoding scheme, he said. “We have developed a little method to simultaneously decode that and align it,” he said.
Both BFAST and SHRiMP generate a “fingerprint” or “key” of the read. BFAST finds all possible locations where this fingerprint occurs in the reference genome and creates an index of that information. “You can use that index multiple times, as many times as you are going to align to that genome,” he said. “So you only need to do that once.”
SHRiMP, in contrast, generates keys from the reads and indexes the reads, he said. “The look-up in the reference genome index is a speed difference between BFAST and SHRiMP,” he said. BFAST short-lists possible locations for a read and then analyzes which one is most likely to be accurate. “We don’t have time to look at the whole genome for each read,” Homer said.
The differences between the two tools, he said, may play out in whole-genome resequencing, which involves around 20x coverage, or 20 times the number of bases of the reference genome. “SHRiMP is indexing the larger dataset and BFAST is indexing the smaller dataset. Since either program must sort the two lists to generate an index or hash, BFAST will be faster in the creation step,” he said. Since BFAST’s index is smaller, the index lookups will be faster, he said.
BFAST uses the standard Smith-Waterman local alignment algorithm for its final sequence comparison. “This is the optimal way to perform local sequence alignment and can handle insertions, deletions, and mismatches. We have also developed a novel extension of the algorithm to perform color space alignment where the color space read is simultaneously decoded and aligned,” he said.
Comparing alignment tools to benchmark them, he said, is commonly based on the percentage of mapped reads in a given alignment. “I don’t agree with that at all,” he said. “You care if they align, but you care that they align to the right place.”
Only after looking at the accuracy does he choose to look at algorithm speed. BFAST, he said, has a built-in time-accuracy trade-off. “If 20 percent of your bases are errors … then it is going to take longer because you need more and more indices.”
Others in the Shed
Brudno said that SHRiMP supports both Illumina’s Genome Analyzer and ABI’s SOLiD. “In general there are a number of great tools out there for Illumina,” he said. “There is Casava, there is Maq, which quite a few people use and are extremely happy with.”
Software development for the SOLiD is still catching up. For color space, there was “really nothing,” he said, noting that even ABI’s internally developed SOLiD software tool had its disadvantages at first, but “it has gotten a lot better.”
Maq, or Mapping and Assembly with Qualities, maps short reads to reference sequences and can also accomplish variant calling. It gives a probability score for each alignment that informs users about the quality of their mapping, and, for example, estimates the error probability of each read alignment.
Brudno noted that any comparisons between software tools are platform dependent. “When it comes to Illumina data, the Maq people are ahead of us; they have really thought about the problem carefully,” he said. “ShRiMP is not meant to be nearly as fast, however it is much more sensitive.”
In Maq and most other tools, he explained, the whole entire read is needed in order to perform mapping. “In SHRiMP, we are perfectly happy to match half the read.” Especially as the reads get longer, SHRiMP will be more applicable to things like reads from RNA data where you can expect to have some reads split at the exon boundary,” he said.
Maq has implemented a number of features that SHRiMP lacks, he said, for example mate-pair support. “Maq is also a lot faster; that is something which we concede completely.”
SHRiMP is not meant to be fast, he said. “It is meant to be sensitive.” That has to do with both the algorithm and that “we are trying to detect even partial matches,” which is “a lot more difficult from an algorithmic perspective.”
Brudno sees his role not so much in developing a SHRiMP tool as much as developing algorithms to understand color space, which he finds interesting “from a mathematical perspective,” he said.
“Given equally accurate data from letter space and color space, color space is better,” he said. “Unfortunately for now they are not equally accurate” when comparing data sets that are treated at different stages of analysis, he said, noting that the raw error rate per color of the SOLiD platform is around 4 percent, while Solexa filtered data is at 1 percent.
Brudno, like Bergman, emphasized the advantage of color space in differentiating SNPs from sequencing errors.
“For example, if a read maps to a particular part of the reference genome with one mismatch, researchers will wonder, ‘Is that position a sequencing error or maybe it is a SNP?’” he said.
Sequencing errors in color space, however, will be apparent, because two sequences differing in one color are going to differ greatly in terms of actual DNA sequence. “So you can reduce the effect of sequencing errors,” he said.
Putting the Tools to the Test
Bioinformatician Quang Trinh at the Ontario Institute for Cancer Research and his colleagues are currently testing both the Genome Analyzer and SOLiD platforms. They are comparing mapping performance and accuracy with various reads, as well as different alignment programs, by, for example, testing many alignment software tools against different reference genomes.
When the data comes off the instrument, Trinh runs it through the respective vendor pipeline software, followed by SHRiMP and Maq. “We are still in the process of comparing the output generated from SHRiMP and Maq and to see what the differences are to decide which one to use,” Trinh told BioInform.
Speed is part of the equation, he said. “But accuracy is more important.” In his experience SHRiMP handles color [space] better,” he said. “SHRiMP doesn’t do SNP calling and SNP calling is something we are interested in as well,” he said.
Paired end mapping is another feature he would like SHRiMP to have, he said.
As far as SOCS and BFAST are concerned, Trinh said he is open to “giving them a try,” too.
“At some point we will have to stop and take the one that worked the best for us, the one we can trust,” he said. At the same time, a software tool may lend itself for a specific platform or a specific scientific question. “They way I look at it, there is data coming off the instrument, we run the vendor’s pipeline, and after that there will be a long pipe of post-processing analysis that depends on scientific questions we want answers to.”
“Short reads are a craze at this stage,” Brudno said. “It’s a problem for today but not for all ages.”
Eventually, he said, perhaps in around three years, short read tools will “go away” as read length increases. “When we were developing SHRiMP, the goal was to map 25[-mer] long reads and today even AB SOLiD puts out 50[-mer] long reads. Give it another two years and there may be 200-[mer] long reads.”
While short-read problems may wane, the technology has delivered important lessons, he said, for example “dealing with high coverage datasets, dealing with massive RNA sequencing, and large datasets in general.”
“Technology moves; 10 years ago people were [saying], ‘Oh my, we have a whole genome’s worth of data, how are we going to analyze it?’ Today, people say, ‘I have 500 human genomes, what do I do with that?’”