As new DNA sequencing instruments with shorter reads and increased data output are gaining traction, software companies that specialize in assembling and analyzing sequence data are adapting their offerings.
One early mover has been DNAStar, a 24-year-old bioinformatics company based in Madison, Wis., which is making a concerted effort to stay ahead of the next-generation sequencing curve. Last fall, the company launched a new version of its flagship bioinformatics software suite LaserGene 7.1, with enhanced capabilities for assembling large amounts of short sequence reads, for example from 454 Life Sciences’ platform. Now, the company is working on an improved version of the assembler in order to keep pace with more recent advancements in the field, such as paired-end reads.
“It’s clearly a growing area,” Tom Schwei, the company’s vice president and general manager, told In Sequence. Schwei said that the company has seen “phenomenal” interest from customers in assembling next-generation sequencing data, “and we think it will only continue to grow.”
While even the previous version of the company’s software was able to assemble short reads, Schwei said that it wasn’t optimized for heavy-duty usage until LaserGene 7.1.
“You want to make sure you optimize parameters that best help you assemble a genome with reads that short,” Schwei said. Also, shorter reads require greater coverage, and thus more reads, to obtain the same assembly quality as from Sanger reads. “You have to make sure that you are optimizing the way you use memory and other things to handle [this amount of data] well,” Schwei added.
In response, Lasergene 7.1 included a new assembly engine, SeqMan Pro, which can handle large amounts of both Sanger and 454 reads. That engine, which runs on a desktop computer, can handle at least 10 times more sequence reads than the previous version of the SeqMan contig assembler, which could take 20,000-30,000 Sanger reads. It also contains a new algorithm that speeds up the assembly process. According to the company, SeqMan Pro has been shown to be capable of assembling over 900,000 reads. Customers have reported using it to assemble 300,000-400,000 reads from 454’s GS 20 in one afternoon.
The company is now working on an improved version, SeqMan Genome Assembler, slated to come out later this year. This assembler will be able to handle paired-end reads and “has some repeat-handling capability that is above and beyond what we have been able to put into SeqMan Pro,” Schwei said. The company is planning to develop this engine further in the future so it can assemble an entire human genome, but “it will take a little while to get there,” he added.
Although it is possible now to use SeqMan Pro to assemble an entire microbial genome from scratch using only 454 data, Schwei said that DNAStar is seeing the greatest use of its software for mixed assemblies that use a combination of Sanger reads and 454 reads to fill gaps and finish the genome.
One of the challenges of this hybrid approach is that ABI’s capillary electrophoresis sequencers and 454’s instrument use different quality scores, he said. While DNAStar has developed its own algorithm to create quality scores for ABI instruments, “we have not attempted to do that yet for 454,” Schwei said. The reason, he said, is that 454’s own quality scores may change in the future, “so I think until we see where that settles out, we would not want to put too much effort and energy [into it].”
“You want to make sure you optimize parameters that best help you assemble a genome with reads that short,” Schwei said.
The same is true for other next-generation sequencers that are coming online now, he said, such as Illumina’s Solexa Genome Analyzer or ABI’s Agencourt Advanced Genetic Analysis platform, which might require additional changes to SeqMan Pro. “As we start to see output from those machines and work with a few partners who have those machines, we will be testing it,” Schwei said. “At this point, it’s a little early for us to say whether there are changes needed.”
SeqMan Pro competes with other academic and commercial assembly software packages that can handle a mix of Sanger and 454 reads. For example, two recent eukaryotic genome sequencing projects that used data from both sequencing platforms utilized, respectively, the FORGE assembler developed by Darren Platt at the Department of Energy’s Joint Genome Institute, and an assembler developed by researchers at Myriad Genetics (see In Sequence, 01/23/2007). 454’s own Newbler assembler currently handles only 454 reads, but at a meeting last fall a 454 executive said a future version will be able to incorporate both data types.
DNAStar believes that its tools will appeal to customers even if other assemblers become available, since Lasergene has capabilities that go beyond the assembler. “We expect next generation sequencing will further stimulate the entire sequence analysis market,” Bob Steinhauser, the company’s director of marketing, told In Sequence by e-mail.