By Vivien Marx
This article has been updated from a version posted April 21 to include additional comments from DNAnexus and outside comment.
BOSTON — Aiming to meet the growing demand for low-cost, user-friendly informatics options for next-generation sequence analysis, Stanford University spin-out DNAnexus this week launched a cloud-based data management and functional genomics analysis service for users of Illumina and SOLiD sequencers.
The company announced the new offering, which uses the Amazon cloud infrastructure to analyze and store data, at the annual Bio-IT World Conference and Expo amid continuing debate on the pros and cons of cloud-based bioinformatics options. At a next-generation sequencing and data-management workshop at the conference, a number of researchers said that they are experimenting with cloud-based options, but quite a few said that they've experienced a few hiccups along the way.
For example, Andi Broka, Linux system administrator at Boston University School of Medicine, pointed out he and his colleagues have had some negative experiences with the cloud, such as not being able to access the necessary number of CPUs in an acceptable timeframe. In another example, he said that a large job "died" due to an overloaded cloud computing provider.
Andreas Sundquist, co-founder and CEO of DNAnexus, acknowledged that Amazon "sometimes" takes a long time to bring out nodes, but stressed that these experiences were an exception rather than a rule. He said that cloud computing offers scientists many advantages, particularly those facing second-gen sequencing bottlenecks.
The DNAnexus platform offers researchers computational resources only when they need them, he said in a presentation outlining the service. The "elasticity of the cloud" enables a scaled offering, he said. Scientists do not need an account with a cloud computing provider. Instead, they can use their DNANexus account and upload their data directly through the company.
The pricing is $95 per Illumina GA lane or $30 per gigabase. The price covers data transfer, analysis, visualization, export, and sharing, as well as indefinite storage, he said.
The platform is targeted at scientists without in-house bioinformatics resources and allows researchers to handle second-generation data without drawing on their own hardware, software, tools, or visualization methods. This allows researchers to work at a higher level of abstraction, away from the raw data, taking them to "the level of biology," Sundquist said.
The company offers a tool that allows users to plug their sequencers directly into an Ethernet connection in order to send data directly to DNAnexus when the run is complete. The DNAnexus platform performs all the analysis automatically and then notifies the user when the job is completed, Sundquist said.
The company is targeting users in core facilities or small labs with a few samples. Although these users could easily sign up for their own individual accounts on the Amazon Elastic Compute Cloud, they must still download tools and build a pipeline, and then tell Amazon where to store and fetch the data, Sundquist said.
By comparison, the DNAnexus infrastructure is built on top of EC2 and serves as a form of middleware between the user and the cloud, he said.
DNAnexus has developed its own algorithms and tools for direct sequence upload, read-mapping, visualization, and analysis. The current version of the platform includes tools for quantifying enrichment in genomic regions, ChIP-seq methods to help with transcription factor binding site discovery, and tools for mRNA-seq-based expression profiling.
In addition to Sundquist, DNAnexus co-founders include Stanford researchers Arend Sidow and Serafim Batzoglou. Sundquist said the company grew out of analysis challenges that the three of them were facing in their own work with second-generation sequencing data.
Sundquist told BioInform that most of the company's in-house tools are based on methods developed in the Sidow lab, such as the ChIP-seq method QuEST, or quantitative enrichment of sequence tags.
Sundquist said that DNAnexus decided to focus on the tasks that "we felt are the most computationally intensive," but the company is not trying to offer all the analysis that might eventually go into a journal submission. Scientists have many tools to choose from to perform that downstream analysis, he said. "We as a small company can't do all of that."
If researchers have a "favorite methodology," then DNANexus might not be the right platform for them right now, he said. But he noted that he and his colleagues are looking to add more methods.
Sundquist said that many researchers are currently turning from Maq as their mapper of choice to Bowtie and BWA because they are a lot faster, but he noted that there are tradeoffs to reach that speed. "One of the things we have built into our read-mapping methodology is the ability to weigh all these mappings against one another," he said.
For example, Bowtie's default setting will return "the best mapping," he said, whereas the DNAnexus method looks "at all the places a read could potentially map," weighs them using a Bayesian technique, and offers the user a probability score for every mapping. This method can help avoid finding signatures in highly repetitive regions, for example.
Initially the company is offering functional genomics analysis for uploaded datasets from Illumina's Genome Analyzer and HiSeq systems and Life Technologies' SOLiD.
Right now the platform does not support de novo assembly, but the firm is developing tools for SNP calling, miRNA detection, and structural variant analysis. Plans to support other platforms beyond Illumina and SOLiD are also underway.
[ pagebreak ]
There are other cloud-based platforms that offer scientists access to analytical tools, such as GenomeQuest's commercial product and the open source Galaxy. Sundquist said that systems like Galaxy are a good example of what can be accomplished using cloud computing, but noted that these approaches differ from that of DNAnexus because they are targeted more at "bioinformaticians at some level" with knowledge about which modules they need to put together to create a workflow.
"Our angle is to make things as simple as possible," he said. Another distinguishing factor for the DNAnexus offering is a focus on second-gen sequencing with "push-button analyses," he said. "We have built it for people who want to do billions of reads."
However, Anton Nekrutenko, associate professor of biochemistry and molecular biology at Penn State University and co-PI for Galaxy, questioned whether DNAnexus is offering ease of use at the expense of transparency.
Nekrutenko told BioInform via e-mail that software tools "need to enable transparent science" and questioned the value of a black-box offering like the DNAnexus system.
"The fact that it is not exactly clear how analyses are done is truly concerning," he said.
Nekrutenko acknowledged that many biologists face considerable challenges in analyzing second-generation sequencing data. A researcher with a dataset ready to go faces the dilemma of whether to "pay $90 to DNAnexus or spend a few days exploring existing open source tools, [in which case] you will probably choose to pay $90," he said. "But is this the appropriate way to do science?"
DNAnexus aims to make the Amazon cloud infrastructure accessible for biologists, but some early adopters of cloud computing remarked during the workshop that they've had mixed results with the approach.
For example, Giles Day, senior director of biotherapeutics informatics at Pfizer, said that his team is "investigating" cloud computing for second-generation sequence analysis, but noted that the input/output bottleneck remains a "big issue." One option he finds attractive is the concept of a "virtual private cloud," which can provide a virtual private network on a separate computational infrastructure.
Day and his colleagues face data-sharing challenges because some biotherapeutics R&D is in South San Francisco and some is in Cambridge, Mass. Rather than shipping hard drives with sequence data, the cloud-based strategy could help with "bucket-to-bucket" transfer, Day said.
Brent Richter, director of enterprise research for Partners Healthcare Systems, said that he and his colleagues have also experimented with "development work" on the Amazon cloud. He noted, however, that patient data cannot be handled and analyzed in that environment and must remain behind a firewall, in line with institutional review board approval.
He mentioned that he and his colleagues are exploring how to import Amazon machine images — pre-configured compute resources on the cloud that contain applications or data — into virtualization software.
Paul Rutherford, chief technology officer at of storage vendor Isilon, said in his talk that for most scientists the "cloud will cost you more" than an in-house storage system. With transfer costs of approximately 10 cents per gigabyte, plus storage costs of 10 cents per gigabyte per month, costs can mount into "serious" dimensions quickly, he said.
Although organizations with "random peaks" of processing requirements might be able to leverage cloud computing to their advantage, second-gen sequencing is causing data growth that reaches beyond random peaks — particularly when it comes to storing the data, he said.
Rutherford cautioned that although cloud computing is an "option," it is "not utopia." Nevertheless, he acknowledged that "eventually" computing will occur in combined environments in which the user may not know if a given analysis is occurring locally or in the cloud.
James Cuff, director of research computing for life sciences at Harvard University, noted in his talk that as sequencers make increasing headway in labs, "server creep" can occur in which multiple labs each establish a sequencer and an attached server — a scenario that might make cloud computing a lower-cost option.
Cuff agreed with the approach to move jobs rather than data, but also cautioned that "queue creep" can cause bottlenecks for both in-house and cloud-based systems.
In his talk, Sundquist said that the cloud lets scientists access as many nodes as a given task might require, which could range from under ten nodes to a few hundred.
Sundquist said that there is a "misconception" that moving data around is difficult with cloud computing. Once data is on the cloud it can be moved "quickly" on high-speed interconnects, infrastructure that currently only large sequencing centers have at their disposal, he said. In this sense, the firm expects that its offering can help solve the next-gen sequence analysis bottleneck.
Scientists can also leverage the cloud-based system to manage and share data with colleagues, which is often a challenge with large datasets, he said. Users can upload their data from, for example, a file server, although not directly stream from a sequencing instrument, he said.