Second-generation sequencers have been on the market for several years now, but labs that have adopted these instruments are still facing traffic jams as they seek to move and share data, participants said at last week's Bio-IT World Conference and Expo in Boston.
Sequencing centers that need to submit large datasets to public databases or provide clients with their results are finding that the files are too large to transfer over most networks. As a result, most groups are saving the data on disks and shipping it via FedEx or UPS, or literally walking it down the hall to a collaborator in an option some have dubbed "sneakernet."
As Matt Trunnell, the Broad Institute's manager of research computing, pointed out in a panel on next-generation sequencing data management and analysis, the "800-pound gorilla" at the institute is sequence data, making up 80 percent of the data he needs to manage, move, and store.
The Broad delivers all of its sequence data to the National Center for Biotechnology Information, but at the current pace of scale-up and technology development, Trunnell estimated that the institute might be submitting 10-20 terabytes per day by the end of the year.
The Internet doesn't work anymore for that kind of data trafficking, he said. "This model is starting to collapse under its own weight."
Harvard University's George Church said in a separate panel discussion that sequencing technology had been following Moore's Law-like growth curves, but since 2005 — when second-generation sequencers first entered the market — it has been growing so rapidly that its growth rate "could get ahead of computing."
Echoing these challenges, Trunnell said that while the Broad has established a large working store for data production and initial analysis, he believes that downstream analysis presents "a much larger problem" in terms of data management.
"We don't understand the dissemination strategy of the data," Trunnell said. In two years, there may be a new type of analysis that requires a different data management strategy, but IT groups have no way of accounting for that in the current data life cycle.
Co-panelist Melissa Kramer, scientific informatics analyst at Cold Spring Harbor Laboratory's Woodbury Genome Center, said that her colleagues sometimes return to older data and that as new software tools are developed it might become more common for researchers to want to re-visit data to analyze it in new ways.
Triage My Data
The difficulties connected to triaging data came up repeatedly at the conference. As Trunnell explained, the Broad used to have one tier of storage that was consistently protected and backed up, but the "fast growth" of data makes that practice "virtually impossible" now.
He and his colleagues are now working on tools to catalog data, "assign dollar signs to the data," and clarify governance. In this environment of more than 1,000 file systems and 26 million folders, it is quite difficult to "assign ownership to that" and one set of business rules cannot be applied to all data, he said.
Kramer said that CSHL, while smaller than the Broad, faces similar sequence data management challenges.
Her team has not been able to automate that triaging process and needs to construct "flexibility" into its system to be able to "get scientists their data now and not three weeks from now," she said.
In collaborations, data exchange becomes a challenge, said Trunnell, forcing his team to resort to the "sneakernet" model of shipping disks to collaborators. That process does not synch well with the approach he and his team seek, which is to bring computing closer to the researchers' sequence data.
As Chris Dagdigian, founding partner and director of technology at the BioTeam, pointed out in his keynote address, data triage discussions, which were formerly scientific "heresy" and limited to cost-sensitive industries, are on their way to accepted practice in research. Scientists and IT staff must reach those decisions together, he recommended.
[ pagebreak ]
Triage might be less painful if scientists were to cut down on elements they don't really need, said Indresh Singh, manager of core informatics at the J. Craig Venter Institute, who spoke during the second-generation sequence data management panel.
He pointed out that some trafficking and storage issues with second-generation sequence data would vanish if labs did not keep image files from the sequencers. When queried by BioInform at the event, Singh said that step alone could cut the data "in half."
Trunnell agreed that the "field needs to grow up" in terms of data management, since keeping large amounts of orphaned data and unstructured data doesn't scale. However, many scientists still believe that they must store every piece of data associated with a sequencing experiment.
Complete Genomics' vice president of software Bruce Martin said on the panel that institutes should enforce data-triage policies that would help users define their needs.
At the same time, he said, storage vendors might also need to develop a "richer metadata file system" to help users manage their data. Trunnell agreed, noting that file system management for large-scale sequence data is "still wanting tools."
Chris Blessington, Isilon's senior director of marketing and communications, told BioInform at the conference that his firm doesn't yet have an answer for this challenge, although he admitted it is "a crying need in this industry."
That functionality might evolve through partnerships with third-party developments, John Gallagher, director of product marketing at Isilon, told BioInform. "Over the last couple of years we have been pulling together very close relationships," Gallagher said, referring to the company's partnerships with Aspera, which develops wide area file sharing technology, and content reduction technology developer Ocarina.
Data in a Box
Complete Genomics' Martin said his firm, which offers human genome sequencing services, is currently generating over one petabyte per sequence per year. Moving that data around and distilling it down "ends up being a lot of computation, a lot of network traffic, and bottlenecks can be anywhere from the network to the computer to the disk drives," depending on the system architecture.
Martin's team has developed an automated pipeline that moves data from one stage to the next within the company, but that changes once the data reaches the research world.
Complete Genomics delivers data to customers by packing boxes full of USB drives and shipping them off. He said the firm relies on a customer's ability to "ingest" the data, store it, and compute on it, though he added that the company is interested in helping the community develop data-management strategies for second-gen sequencing data. While there might be "some business opportunities" to store customer data, that is not part of the Complete Genomics business model, he said.
"There are really only two mechanisms we've identified" for delivering data, he said. "There's nothing brilliant here, there's electronic delivery via the net and burning physical media and using FedEx or UPS." Right now, he said, his firm burns hard drives for its early-access customers, performing quality control during analysis and also during data transfer using check-sums or hash coding.
Delivering via FedEx or UPS is "incredibly good" in terms of cost and throughput, he said. Electronic delivery will be "the ultimate winner" as networking costs drop, but for now it is not practical unless two computers in the same data centers can "talk to each other over a very big pipe."
The company plans to sequence 1,000 genomes in the second half of 2009 at a price of $5,000 per genome. "There [are] I/O bottlenecks all the way up and down [the process]," he said. Since the firm does not deliver the images associated with the sequence output, the bottlenecks are internal to the company, he said.
Down the line, as the instruments and their error rates improve, Martin said he believes customers will not want raw reads but only the assemblies, the information about genomic-level variation, and the genome structure.
[pagebreak ]
"Frankly it's far more efficient in the long run to keep the DNA as a backup and redo the sequencing," he said, referring to cases in which there is enough sample to preserve. "It's the densest backup.
"If it's only a few thousand dollars, it's cheaper to redo it than to park all that data on hard drives for years," he added.
When Martin voiced this view on the panel, however, the Broad's Trunnell said that when scientists do need older data, it only might "seem" as if resequencing would be easier than storing data, because in reality that option would likely run into scheduling conflicts — especially when a lab's sequencers are already running 24/7 on new sequencing jobs.
Scale-out Software
Software vendors, too, are exploring ways to address data bottlenecks in second-generation sequencing projects. As projects scale, software offerings will differentiate out, Ron Ranauro, president and CEO of GenomeQuest, told BioInform at the conference.
Scientists in second-generation sequencing projects start out with data that "is surface-level information" for which data warehouses must be built, he said. "You quickly run into a scalability bottleneck."
One answer is to distribute the database, he said. GenomeQuest claims to offer a scalable solution to remedy data trafficking bottlenecks because it combines storage, algorithms, and data together and redistributes it across a cluster.
With early-access customers in academic labs, GenomeQuest firm is exploring a pay-as-you go business model that avoids "the big upfront investment" in IT and database management, Ranauro said. Customers are charged per sequencing run, which allows users to "interrogate the data without having to re-program the pipeline," he said. They can sort, filter, and group the data based on alignment properties and "really get inside the assemblies."
The GenomeQuest platform lets scientists do all-against-all genome comparisons without needing companion programmers and IT staff, he said. "You have to be able to demonstrate you can do it on a large-scale, a whole genome, not one gene variation detection on the gene but on 300 candidate genes," he said.
The tests with early-access customers have been running for about a year, Ranauro said. Labs can deploy the GenomeQuest platform locally as a number of undisclosed customers do, or they can deliver services to their own customers as, for example, a core facility might.
One GenomeQuest customer is a research team at the plant pathology department at the University of California, Davis, studying healthy and diseased Syrah plants from California vineyards.
In a paper published in the online version of Virology on March 23 , the UC Davis researchers describe how they sequenced 67 megabases of RNA sequence with the Roche/454 instrument, and applied a number of bioinformatics tools including GenomeQuest's platform to perform all-against-all comparisons against the genomes of known organisms.
Ranauro said he believes the approach the scientists took applies to human, plant, and other sequence data and that he hopes that many scientists will choose this option as sequencing prices drop and data volumes rise. "We're sort of betting, along with the rest of the industry, that this is an emerging technology," he said.
He acknowledged, however, that there is considerable competition in the software market for next-generation sequence analysis. "We all look like we're solving the same problem," he said.