NEW YORK (GenomeWeb) – Curoverse has launched a public beta program to test the first commercial products that it has created based on Arvados, an open-source software infrastructure platform for managing, processing, and sharing genomic and biomedical data that was initially developed by researchers at Harvard Medical School.
Curoverse has developed both cloud-based and on-premise implementations of the Arvados platform and it plans to launch these commercially in the second half of 2015, most likely in the third quarter of the year, according to CEO Adam Berrey. The company is already testing both products in private pilots in several medical and research institutions in the US and Europe, but it is now opening the platform up to testing by a much wider pool of potential customers through the public beta.
Both the cloud and on-premise implementations offer the same data management and processing capabilities, interfaces, and pipelines. For the beta, Curoverse will work with clients interested in trying out the local implementation of the platform to deploy and run the Arvados appliance on their internal clusters and servers. The company is working with hardware partner Intel, which is providing support for the Arvados on-premise pilots including providing funding for equipment that is being deployed at a number of participating institutions.
Meanwhile, participants interested in trying out the Curoverse cloud can sign up for free accounts on the company's cloud infrastructure. These free accounts will offer access to public datasets in the Curoverse infrastructure as well as available pipelines created by the company or other existing users. Beta testers will be able to upload and analyze their own datasets using the system as well as create, run, and share custom pipelines that they create through the cloud.
A standard cloud beta account will come with one terabyte of storage and 100 hours of compute time, per month, for at least six months regardless of when the platform is launched, the company said. Also, Curoverse is willing to work with clients who want to run larger pilots as part of the beta, Berrey said. However, larger pilots could mean associated costs depending on the size of the datasets and requisite compute power. The exact costs will be determined on a case-by-case basis.
When both products go on the market later this year, cloud customers will have the option to pay monthly or annual subscriptions fees for access to the cloud-based iteration of the solutions that will cover storage and compute capacity and other costs. Customers of the company's on-premise appliance, called Arvados Cluster, will also pay an annual subscription that will cover support and operation costs of running the Arvados appliance. They'll also have the option to either install the system on their existing hardware, in which case they only pay for the subscription costs, or to purchase a bundled hardware-software offering from Curoverse that will include the Arvados appliance preinstalled and optimized to run on Intel hardware. Customers of the bundled offering will pay for the initial hardware sale and then be charged for annual subscriptions.
Curoverse is not disclosing the exact costs for subscriptions to either of its commercial options.
Arvados was originally developed by HMS researchers to manage genomic and biomedical data being collected for research projects such as the Personal Genome Project. Curoverse first announced its plans to develop commercial products based on Arvados in late 2013 using funds from a $1.5 million seed round. It has spent the last year and a half engineering the platform and working with early customers to test and refine the solution, Berrey said. In developing the solution, "we stayed very consistent and true to our central strategy, which is to build infrastructure software for precision medicine, run it in the cloud and on premises, be open source, [and] enable sharing."
Arvados provides data management and processing tools that help users organize, manage, verify, and track terabytes to petabytes of data and run complex analytical workflows on elastic computing infrastructure in a consistently reproducible fashion, Jonathan Sheffi, VP, customer and business development at Curoverse, explained to GenomeWeb. It makes tools like the Broad's Genome Analysis Toolkit and other analysis pipelines easier and faster to use and manage, he said.
Arvados' data processing system, called Crunch, makes it easy to define pipelines and run jobs on distributed computing infrastructure, Berrey further explained. Crunch uses Docker containers to hold the various components of a given pipeline and define the run environments for each component, and it then provisions compute resources as needed to schedule and run compute jobs.
Users can easily set up and run new projects within the system and they can also track active pipelines as well as view completed jobs. Datasets within Arvados are assigned unique identifiers for easy reference, and users can select and analyze subsets of larger datasets of interest independently without duplicating any data on disk, Sheffi said. For each analysis run, Arvados tracks and keeps records of all of the inputs, outputs, code versions, and the parameters used for each step of the pipeline, and it also keeps track of any changes that users make to the pipeline with each run — if they use a different parameter for one of the pipeline components for example — so analyses are reproducible.
The system comes with preinstalled pipelines that can be customized as needed, but users can also build their own pipelines from the different components available in the system. Arvados also generates a provenance graph at the end of analysis runs that lets users compare the different instances of the pipelines that they've run highlighting the different parameters used as well as different outputs.
In addition, the system provides mechanisms for securely sharing data and analytical pipelines within and between labs but also for making those publicly available on the internet for anyone to download and use, even if they don't have accounts on the Curoverse cloud. That last option has been used to publicly share tools and data from the recent study of the microbiome of the New York City subway system run by Christopher Mason, an assistant professor in the department of physiology and biophysics at Weill Cornell Medical College's Institute for Computational Biomedicine. The company believes that this ability to publicly share data on the Internet is one of the factors that distinguishes its offering from companies such as DNAnexus and SevenBridges, which also offer cloud-based tools and resources, Berrey said.
Besides the public beta, Curoverse has been testing and evaluating Arvados in private pilots run at institutions such as Johns Hopkins University, Harvard Medical School, and the Wellcome Trust Sanger Institute (WTSI). Researchers at these and other institutions are evaluating the system for use in clinical research projects focused on specific diseases or sections of the genome as well as for providing consistent pipelines for tracking and analyzing targeted panel and whole-exome data for clinical diagnostics.
Researchers in the human genetics informatics department at WTSI, for example, became interested in Arvados because they needed a system that would enable them to manage data, computational and storage resources, Joshua Randall, senior scientific manager in Human Genetics Informatics at WTSI, told GenomeWeb. His team handles the informatics needs of WTSI's human genetics faculty including NGS data processing and variant calling as well as maintaining various analysis packages and shared resources such as scratch disks and storage.
"We were thinking about the problems that we have in dealing with these datasets and the challenges coming in the future as we experience more of a crunch in terms of sequence data production rate versus the cost of storage or ... the relative costs of those two things changing over time," he explained. His team initially planned to come up with their own data processing and management system but then came across Arvados and decided to try it out.
So far, the system has worked well for the WTSI researchers, and this early iteration offers several of the capabilities that were on the team's wish list with some room for growth and future development including stronger support for multiuser environments. Among other benefits, "the fact that it supports the Docker containerization is good because it means that we can reliably store the software that's being used in a way that we actually capture the environment much better than we ever have before," Randall said. The containerized approach also makes it possible to begin to make tradeoffs between storage and compute, he added. If, for example, it takes a day's worth of compute time to run a particular pipeline, it might make more sense to simply rerun the pipeline each time those outputs are needed rather than pay for storage space to store the results long term.
The WTSI researchers have been running Arvados on their own cluster but will soon purchase the Intel hardware that's being offered along with the appliance, Randall said.
Although there are commercial iterations of Arvados, the source code remains open source and researchers have the option to download and implement the infrastructure themselves. Curoverse is preparing to release version 1.0 of the infrastructure this summer before the launch of its commercial offerings.