By Julia Karow
As users of high-throughput sequencing platforms continue to struggle with managing and analyzing large amounts of sequence data, DNAnexus, a new Stanford University spinout, promises relief.
The Palo Alto, Calif.-based firm, which uses a cloud-based infrastructure to store customer data, recently launched web-based data management and functional genomics analysis services for users of Illumina and SOLiD sequencers. It hopes to attract customers based on both ease of use and pricing.
In the future, the company plans to expand its analyses types and to support additional sequencing platforms.
Andreas Sundquist, co-founder and CEO of DNAnexus, told In Sequence that the company is targeting individual researchers, as well as small and mid-sized next-gen sequencing facilities that find the IT and informatics associated with the data challenging.
"We know that there are a lot of people out there that have a whole bunch of data sitting around that they are struggling to analyze. Just having the ability to go to a website, upload it, and start doing useful things with it immediately would be great to get up as soon as possible," he said.
At many sequencing core facilities, data storage and file serving is "far from streamlined," he said. "We hope we can help a lot of the small and medium-sized sequencing centers offload all their informatics and IT worries onto DNAnexus."
Sundquist, who has a PhD in computer science from Stanford, and two Stanford professors — Serafim Batzoglou, an associate professor of computer science and his former advisor, and Arend Sidow, an associate professor of pathology and genetics — founded the company in early 2009 after realizing "that there is a big unaddressed need in sequence analysis" that could be addressed by a cloud-based infrastructure.
Last summer, the company, which currently has six full-time employees, raised $1.55 million in venture capital funding from lead investor First Round Capital, as well as K9 Ventures and SoftTech VC.
Customers upload their sequence data to the company, which uses Amazon's compute and storage infrastructure and a set of in-house algorithms to automatically analyze the data. After the reads are mapped, customers can view them in a Flash-based web browser and download results they are interested in — for example, tables of genes and their expression levels, or ChIP-seq peaks for a certain area on a chromosome — for further analysis. Since it's web-based, the service allows users in different locations to share data easily.
Uploading a typical FASTQ file from an Illumina sequencer, about 1 to 2 gigabytes in size, takes between 15 and 30 minutes "on a reasonable connection," according to Sundquist.
The company recently started offering quality-control and functional genomics analyses of sequence data from Illumina and SOLiD platforms, including ChIP-seq, RNA-seq, 3'-end RNA sequencing, and DNase hypersensitive and restriction sites. Other analysis types will be added over time, with genomic variation discovery a top priority. Genome de novo assembly, on the other hand, is further down the list, since it is still challenging to automate entirely, Sundquist said.
All analysis tools and algorithms the company currently uses were developed either in house or at Stanford, and are described in white papers that are available to trial users from the company's website. "I think people will find that our methods are competitive with the methods that are out there," Sundquist said.
Some users might prefer to use other analysis tools, though, and the firm's long-term plan is to enable its platform to integrate third-party methods. "But we have to weigh that against the more integrated and seamless experience, where everything just works together," he added.
DNAnexus hopes to compete with companies offering bioinformatics tools for next-generation sequencing data — such as Geospiza, CLC Bio, GenomeQuest, and SoftGenetics — both on ease of use and price.
Visualizing the data is an "integral part" of the analysis, Sundquist explained, enabling users to assess the data quality and to zoom in on loci of interest in the context of the genome. "We are one of the first companies to do that in a fully web-based and cloud-based fashion. There is nothing you have to install, nothing that locally runs on your computer, there is no data that's transferred other than the data you are viewing in your browser."
The company charges for its service either per sample or per sequencing machine, offering volume discounts. For a single Illumina Genome Analyzer lane, prices currently range from $55 per lane to $95 per lane, and for a quarter SOLiD slide or a lane on Illumina's HiSeq2000, from $20 to $30 per gigabase. This includes data transfer, analysis, and indefinite storage using Amazon's storage service, though older data may be archived and not be as readily available for visualization. Customers using the machine-based pricing option can upload the data directly from their sequencing instrument to the company.
[ pagebreak ]
These prices, Sundquist said, represent "a small fraction" of the reagent costs to generate the data. "If you look at what more traditional companies have been charging for this type of service, I think you will find that this has a low barrier to entry," he said, because it does not require customers to purchase any hardware, or software licenses.
The company decided to support Illumina sequencers and SOLiD first because they make up a large segment of the next-gen sequencing market and provide the highest throughput of sequence data. The need for analysis services seemed less urgent for 454's sequencing platform, which "has a much better tool chain," Sundquist said. In addition, many 454 customers use their instruments for microbial sequencing and metagenomics, he said, two applications that DNAnexus currently does not support.
So far, the company has no co-marketing agreements with Illumina or Life Technologies, but is in discussion with these vendors.
It also plans to support additional sequencing platforms in the future based on how many users they have. Ion Torrent Systems, for example, seems "a great fit," Sundquist said, because that system's anticipated low purchase price of around $50,000 matches DNAnexus' idea of low analysis costs.
Currently, DNAnexus can analyze sequence data from seven different organisms — human, chimpanzee, mouse, Drosophila, C. elegans, Arabidopsis, and baker's yeast — but plans to "constantly add new genomes," Sundquist said.
Bob Steen, director of the biopolymers core facility at Harvard Medical School, which currently has two Illumina GAII sequencers, said that several of his users — mostly those with little bioinformatics support — have tried out DNAnexus' service on an introductory basis and "have actually really enjoyed it."
He said his users fall into three categories: those who can work with raw sequence reads; those who need help setting up bioinformatics tools but can process the data themselves; and those who have no informatics resource available to them and need a lot of assistance to analyze the data. "That's the community that benefits the most from these web-based tools like DNAnexus," he said.
Daniela Kenzelmann Broz, a postdoctoral researcher in Laura Attardi's lab at Stanford, is one such researcher. Her project involves both ChIP-seq and 3'-RNA-seq experiments — with sequence data generated by the Sidow lab — but her group has little expertise in bioinformatics and IT. "Initially I thought I would have to form a collaboration for the data analysis," she told In Sequence. Also, the datasets were too large for her to store on her laptop computer.
DNAnexus has enabled her to analyze the data on her own. For example, "it allows me to try different parameters and see how it affects the results without having to work on the command line and know a lot about programming," she said. In the end, "you still need to know what you actually want to do with the data, but I guess that's why we are scientists."
[ pagebreak ]
Features worth adding, she said, would be an option to import published datasets into the browser for comparisons, and the display of additional information, such as conserved genome regions.
For a core facility, DNAnexus' service also translates into cost savings, both because the core does not need to expand its computing hardware, and because the need for bioinformaticians is reduced. "It's a huge savings in terms of infrastructure, if this is successful, and also personnel-wise," Steen said.
Beside DNAnexus, some of Steen's users have tried a web-based analysis service offered by GenomeQuest. The main difference between the two is that GenomeQuest enables customers to build their own data analysis workflows, but it does not visualize the data in a browser like DNAnexus.
While GenomeQuest may be more for "people who feel more comfortable wanting more control over the data analysis and the tools used in the analysis," with DNAnexus, users "get this nice analyzed dataset in a nice browser" but "you don't necessarily know the details of the software that's doing the various pieces," said Andrew Gagne, a software engineer at Harvard Medical School who works with Steen and his core facility users.
Steen said he will start recommending DNAnexus as one option for data storage and analysis to users of his core. But he is not yet ready to offload all the sequence data from his facility to the company. "It's something we are thinking about, and that could be of value, especially to small cores," he said. Uploading data from the sequencers straight to DNAnexus "would make things even easier," he said, but "you really get into issues of privacy and confidentiality."
But undoubtedly, services like DNAnexus' will help core facilities like his own. "All of these cloud-based and web-based initiatives and tools, from the point of view of a core lab director, are wonderful," Steen said.