One of the biggest challenges these days for next-gen sequencing is informatics, and that ranges from primary analysis to secondary alignment and assembly. While many researchers continue to take advantage of vendor-supplied pipelines, a slew of third-party software packages is becoming available as users try to make the most of what the new instruments can do. One of the most challenging tasks confronting scientists is choosing the best algorithm to use. While many programs exist for mapping short reads, there's still a paucity of good tools for emerging applications such as transcriptome analysis and detection of structural variation like CNVs, indels, and inversions.
"The most important factor to me is whether an algorithm is practical," says Heng Li, a postdoc in Richard Durbin's lab at the Sanger Institute and author of the popular alignment tool, Maq. "Being practical, an algorithm has to achieve a good balance between accuracy, features, speed, and memory because no single algorithm can outperform the others in all these aspects."
The University of Toronto's Michael Brudno, whose alignment algorithm called SHRiMP is one of the few capable of mapping the color read data from Applied Biosystems' (now Life Technologies') SOLiD, says that at the moment there are no exceptions to the rule that speed and sensitivity are mutually exclusive. While some programs are faster, others are more accurate. "Alignment is a tradeoff between sensitivity and speed," he says. "If you align against a very small region, you can be much more sensitive." Even though the best alignment algorithm is still Smith-Waterman, it's meant for Sanger reads and is too slow to match millions of small reads back to a genome, especially for large mammalian ones.
While local alignment using the Smith-Waterman algorithm is part of many of these new algorithms, "it's more of a question of how do you hack [Smith-Waterman] to work fast with this type of data," says David Craig, a researcher at the Translational Genomics Research Institute in Arizona. "Some aligners are very good at identifying substitutions, some aligners are better at identifying indels. So it really depends on the question and what the research is focused on that helps us choose which alignment algorithm [to use]."
Getting picky with aligners
While there's no shortage of third-party aligners out there, many users opt first for vendor-supplied pipeline software and then branch out to open-source code. Roche's Genome Sequencer FLX comes with an aligner called Newbler, and most choose this, especially when nothing else is available and you're "kinda locked in," says Bruce Roe, whose University of Oklahoma lab has done a lot of beta testing on the FLX. Others, like Arizona's Rod Wing and JGI's Feng Chen, also default to Newbler, but they acknowledge having to explore their options when it comes to large amounts of data and shorter reads from Illumina and SOLiD. Gabor Marth at Boston College says it boils down to ease of use and speed, and at the moment there really just aren't that many to choose from. His own tool, MOSAIK, is one of the few wide-spectrum aligners — it can align read data from capillary sequencing data, as well as from Roche, Illumina, and Life Technologies. Importantly, it also does gapped alignment, which detects indels. This is necessary to use on the Roche platform, Marth says, which is known to exhibit more insertion and deletion errors than Illumina.
Different software is needed to handle shorter reads from Illumina and SOLiD. Eland is the default software that comes with the Illumina sequencer, and the package is "really an amazing software," says Li. "Although the core part has not been modified for quite some time, Eland is still generally better than many new aligners." Illumina just launched an improved version of its data analysis package for the Genome Analyzer called Genome Studio. The new software offers new alignment modules for finding SNPs and performing transcriptome analysis.
Eland is not known for being able to align longer reads, or those greater than 32 base pairs. However, Li says, embedded Perl scripts do allow the software to align short reads and mate pairs efficiently. A lot of people use Li's own Maq, written early last year, because it is better for paired end reads and longer reads. Maq, or Mapping and Assembly with Qualities, does both mapping and variant calling, and the output is a quality score. Michael Dorschner, director of the University of Washington's high-throughput genomics unit, uses Eland, which runs fast for the types of short reads he's aligning, including tag-based assays like ChIP-seq, DNase-seq, and RNA-seq, he says. Charles Nicolet at the University of California, Davis, says, "A lot of people around here use Maq and Velvet [for assembly], because … they're free, relatively straightforward, and designed to handle the kind of output generated by the instruments."
Exhaustive aligners, like Maq and Novoalign by Novocraft, come into play when accuracy is more important than speed, such as for SNP and indel detection. Li says that a lot of end users like the rich feature set of Maq, which includes not only the quality score, but also its ability to do gapped alignment for paired end reads, its SNP calling function, and additional tools for downstream analyses. "However, Maq is much slower than Eland. I know sometimes people may like to use Eland when the computing resource is really limited," Li says.
Finding software that handles a specific research goal is important. "The number one decider is what it is we're doing, and one of the key things that helps decide that is, are we trying to find indels or not," says TGen's Craig, who uses a variety of aligners for finding small insertions and deletions, including Eland, Maq, and SOAP. Like Novoalign, SOAP is optimized for exhaustive, whole genome alignment of short reads. Another aligner that Craig relies on is BFAST, developed by Nils Homer for SNP detection. Because he and his team do targeted sequencing — for example, sequencing candidate genes for autism that might put their alignment search around 200,000 bases instead of the entire genome — "we can explore more options," he says.
The SOLiD technology is different because its reads are reported in color space, which requires either the default Mapreads aligner or another tool that can convert color space to letter space. For assays performed on the SOLiD, Dorschner has been using Mapreads as well as Maq for alignments. "Maq is faster for performing alignments to the whole genome," he says. For resequencing efforts, he uses SOLiD's own software since it incorporates a SNP caller as part of the standard analysis pipeline.
David Craig uses BFAST for his SOLiD runs, since it's one of the few algorithms besides Mapreads that can handle color reads. "[Nils] has been working on detection of indels with ABI — that's not really standard yet — and so he's spent a lot of time focusing on that. He's made advancements that they just haven't had time to," Craig says.
The two other most common aligners that are used as an alternate to Mapreads are Maq and SHRiMP. SHRiMP, which stands for Short Read Mapping Package, was developed by Michael Brudno's lab in Toronto. "For regular DNA data, it just does the Smith-Waterman algorithm, [but] for color space, we've developed a version of the Smith-Waterman algorithm that does simultaneous alignment and handling of color space," Brudno says.
One of the main reasons that a user would choose SHRiMP over Mapreads is that it can do local alignments, says Francisco de la Vega, distinguished scientific fellow at Life Technologies. It can also detect small insertions and deletions right away, whereas Mapreads does a quick global alignment and then goes back with post-processing scripts to find gaps. "SHRiMP can do that from the beginning because essentially it's dynamic programming, but the penalty you pay is that SHRiMP is quite slow and requires more memory than Mapreads," he says.
Looking for base callers
While vendors typically connect users to open source alternatives for secondary analysis, they haven't yet opened up base callers to the public. UW's Dorschner says that he uses the vendor-supplied base callers for both Illumina and SOLiD "primarily because they are already integrated into the analytical pipelines. We haven't spent much time looking at alternative base callers as these would be stand-alone applications that would require integration with the analysis pipeline."
Nicolet at UC Davis adds that "the pros are it's wired into the instrument analysis package, so there are few things the operator can do to mess up. That's also a con, because there is less flexibility and you're stuck using the parameters [the vendor] thinks are important." For certain applications, like very long reads or genomes with distorted base ratios, this really does hurt analysis, he says.
Out-of-the-box base callers are typically not open source, which is one of the reasons people have started writing their own. UK developer Nava Whiteford has recently made his SWIFT software, which processes image data and does base calling, publicly available for beta testing. Another base caller is AltaCyclic from Greg Hannon's lab at Cold Spring Harbor Laboratory. Alta-Cyclic improves the number of accurate reads up to 78 bases using machine learning algorithms.
According to recent posts at the Solexa Google users group, many users are concerned about how much and whether to archive run data. While many choose either not to back up reads at all or to delete data after a short period of time, the consensus seems to be that in anticipation of improved and more accurate third-party base callers, users should instead consider saving at least the run data to re-run their sequence using a different base caller. However, BC's Gabor Marth thinks that there will be less and less need for additional base callers as vendors improve their own. "I think we'll see that the base callers become better out of the box, which was not the case to begin with — at least they will be good enough for [common] applications," he says.
Assembly and emerging applications
While multiple options exist for mapping and alignment, software that handles de novo assembly is scarce. "De novo assembly is a much harder informatics problem," Marth says. Velvet, developed by Ewan Birney's group at the European Bioinformatics Institute, is one of the most commonly used assembly programs. Among others are the Broad's Allpaths and Pavel Pevzner's Euler-USR. "I think there are a lot of people working on assembly, but I'm not seeing the same flurry of four or five or six different programs out there," says Craig at TGen. "Whenever I ask people, 'What are you using?' the answer always seems to be Velvet. And that's what we're using."
Marth says that it's still unclear what the capabilities are for current assemblers and whether it's practical to use them for large genomes. "I'm optimistic that maybe in the next six months we'll be seeing some large genomes de novo assembled with maybe a combination of longer and shorter reads," he says.
The ability to handle mate pairs from paired-end sequencing runs is also an area that many users say needs work, especially because it can go a long way toward improving upon de novo assembly. While all four second-gen vendors — Roche, Life Technologies, Illumina, and Helicos — now support paired-end sequencing protocols, there are only a handful of applications that can assemble paired reads. Up next for SHRiMP are improvements in aligning mate pairs, Brudno says: "We're improving mate-pair support."
Software has to catch up with emerging sequencing applications, too. As it stands, Marth sees many users adapting existing mapping algorithms to accommodate newer applications such as transcriptome analysis and bisulfite sequencing. For each, one needs an aligner that is flexible enough to, say, align across intron-exon boundaries or deal with artificially mutated DNA, Marth says.
"A lot of people are looking at chromosomal abnormalities, and there aren't really good solutions yet," Craig says. "How do you find a balanced translocation with short reads and paired ends?"
Marth thinks that callers for SNPs and structural variation are "in flux," but that there will be good software tools available to do these in the next six months. Life Technologies expects to release tools for finding inversions and translocations and for performing whole transcriptome analysis next year, and also plans to start working on methylation analysis.
Heng Li hopes to simply make it easier for users who, at the end of the day, are the ones facing tough choices without all that much to go on. His new alignment program, BWA, will be "much more efficient than Maq," he says. He's also working on a generic alignment format called SAM, which is the collaborative result of the 1,000 Genomes Project and "will become the only format when the alignment is released," he says.