SALT LAKE CITY — Emerging next-generation sequencing technologies are reviving interest in commercial bioinformatics tools, according to some bioinformatics firms that showcased their software at the annual Association for Biomolecular Facilities conference here this week.
A number of software vendors at ABRF 2008 were touting new solutions for analyzing or managing data generated by next-generation sequencing instruments, and several reported an upsurge in interest over recent years — a change of fortune that vendors welcome after limping through the dry spell that settled on the market after the hype from the Human Genome Project faded in the early 2000s.
“Next-generation technologies have gotten people enthused about sequencing again,” Todd Smith, CEO of Geospiza, told BioInform during the meeting.
His company was previewing a version of its FinchLab laboratory information management system that it is designing specifically for next-gen sequencing data. Geospiza plans to launch the system, called FinchLab Next Gen Edition, in April.
SoftGenetics was also demonstrating an upcoming product that is targeted at the high-throughput sequencing market. The software, called GeneNext, can quickly perform de novo assembly and mutation analysis on a laptop using data from 454 Life Sciences’ GS FLX, Illumina’s Genome Analyzer, and Applied Biosystems’ SOLiD system. It will be released in early March.
John Fosnacht, vice president of sales and marketing for SoftGenetics, said that the company has exhibited at a handful of previous ABRF conferences, and that this year’s meeting “was the best ABRF conference we’ve had.” He said that interest among attendees was “predominantly in the next-generation sequencing software.”
Fosnacht estimated that out of the 400 or so academic attendees at the conference, the company generated “in excess of 70 leads” from labs that “have or are putting in next-generation systems and have no way of analyzing the data.” He said that many of these groups — particularly small to mid-sized core labs — “don’t have the funding, the time, or the expertise to get things running fast enough on their own.”
Core lab managers at the conference supported these observations. “Most labs really don’t have a concept of what it takes to analyze this data,” said Peter Schweitzer, director of the DNA Sequencing and Genotyping Core Facility at Cornell University.
Schweitzer made his remarks during a presentation in which he described his lab’s experience with Illumina’s Genome Analyzer, and said that the system generates around 700 gigabytes of data, which can quickly add up to considerable storage requirements. In addition, he said, “you must move the data several times” in order to analyze it, which quickly leads to data-transfer bottlenecks. He estimated that it takes around six hours per run to transfer data from the instrument to external hard drives.
Dick McCombie, a professor at Cold Spring Harbor Laboratory, also discussed the informatics hurdles associated with the Illumina system. He cited the six-hour data-transfer time per run as one obstacle, but added that these files then take a full day to analyze on a 10-Xeon cluster.
McCombie’s group at CSHL currently has five Illumina sequencers and plans to purchase two more, but he noted that the lab is already “struggling” with the 6 terabytes of data it generates each week.
Another challenge that he noted was the fact that the system generates very large text files, which are very difficult to analyze without a graphical analysis front-end. “The output is virtually unusable unless you’re a programmer,” he said.
CSHL is working with the consulting firm BioTeam to adapt its wikiLIMS technology to its sequencing workflow [BioInform 02-01-08
]. McCombie noted in his presentation that BioTeam has developed a graphics package that translates the Genome Analyzer text files into bar charts, graphs, and other visual representations of the data. He said that his group at CSHL is currently using these tools to analyze quality control data from the Illumina instruments.
“Most labs really don’t have a concept of what it takes to analyze this data.”
Research groups are also facing informatics challenges with 454’s next-gen sequencing technology. Savita Shanker, scientific director for sequencing at the University of Florida’s Interdisciplinary Center for Biotechnology Research, said that her group purchased 6 terabytes of storage capacity when it installed its 454 system in 2005, and that it is currently upgrading that to 60 terabytes because it has recently installed an ABI SOLiD. The lab also has an 80-CPU cluster that it uses to analyze data from the instruments.
Shanker said core labs considering buying a next-gen instrument should ensure they have “sufficient bioinformatics expertise and computational capacity.” She said that her team includes four full-time data-analysis staffers that primarily focus on annotation.
Ken Dewar, an assistant professor at the Genome Quebec Innovation Center at McGill University, said that it typically takes around six hours to capture an image on the 454 GS FLX, around seven hours for image analysis, three hours to transfer the data off the instrument, and then only around 10 minutes for assembly.
While the image files are around 14 gigabytes per run, Dewar noted that this is condensed during the analysis process so that the final assembly file is only around 4 megabytes per run. The image files should be temporary, he said, but noted that “we’re not always confident that we got [the analysis] right, so we keep them so that we can go back to them.”
Dewar said that the center currently employs two data analysts, but acknowledged that it’s “not enough” for the amount of next-gen data they are handling right now.
Vendors to the Rescue?
Software firms like Geospiza, SoftGenetics, and others are marketing their tools as a way to help these labs bridge the next-gen sequencing bioinformatics gap.
Fosnacht said that SoftGenetics’ GeneNext “fills a very large void” in the market. Core labs “are buying these machines, but then they wonder what to do with the data,” he said. He noted that ABRF attendees were particularly interested in analysis tools for de novo assembly, target assembly, and mutation discovery — capabilities that the initial version of GeneNext will offer.
SoftGenetics approaches software development as an “integrative process” in which it first develops “the basic engine” for analyzing data and then outlines the next steps for future versions of the software. In the case of GeneNext, Fosnacht said future versions will likely address reporting and data-management requirements for next-gen sequencing workflows, as well as emerging applications.
Another bioinformatics firm, DNAStar, also highlighted its next-gen sequence-analysis software at ABRF. The software, called SeqMan Genome Assembler, is designed to assemble Illumina, 454, and Sanger sequencing data, and to visualize it within a single environment.
DNAStar representatives at the conference said that they expect this capability to be of particular interest to the next-gen sequencing market because many labs are acquiring instruments from several vendors and would like to view the resulting data together in a single program. The software is integrated with the company’s Lasergene software for post-assembly analysis.
On the data-management side, dnaTools and BioTeam both marketed customized LIMS platforms that handle next-generation sequencing data, while Geospiza showcased FinchLab Next Gen Edition.
Smith told BioInform that Geospiza is taking a “more comprehensive approach” to next-gen data management than some of the custom solutions. Customers may find that custom LIMS tools “fill an early hole, but they’ll graduate into IT pretty quick,” he said.
In response to that anticipated demand, Geospiza recently signed an original manufacturer equipment agreement with Isilon Systems to integrate Isilon’s clustered storage technology with FinchLab Next Gen Edition.
The combined solution will include Isilon’s X-Series storage system as part of the system. It will come with 7 terabytes in initial storage capacity that is expected to scale to 1.6 petabytes.
Geospiza is also continuing a project it began several years ago to adapt HDF, a data format designed for large-scale scientific data sets, to sequence data [BioInform 10-31-05
Smith said that Geospiza has completed feasibility studies for the new format, called BioHDF, primarily in the area of genotype analysis, and has developed a software tool called HDF View that allows users to view and edit HDF files. The company is currently working on an extended version of the standard that will support next-gen sequencing data.
HDF should be particularly applicable to high-throughout sequencing data, Smith said, because it is structured so that it compresses data “in chunks.” Rather than uncompressing the data from an entire run at once and re-analyzing it, he said that BioHDF should allow bioinformaticists to perform “incremental processing” on extremely large data sets.
Smith said he expects BioHDF to complement other emerging formats for next-generation sequencing data, such as SRF, and noted that many of the concepts that are embedded in SRF could be implemented in BioHDF.