NEW YORK (GenomeWeb) – Bioinformatics companies Spiral Genetics and Curoverse have signed agreements with Microsoft to make their respective bioinformatics platforms available on the company's Azure cloud.
Specifically, Spiral Genetics has deployed BioGraph, its proprietary method of compressing and querying large quantities of next-generation sequencing data, on the Azure cloud. Curoverse, meanwhile, is making Arvados, its open source platform for managing, processing, and sharing genomic and biomedical data available on the Microsoft cloud. Curoverse is also collaborating with Microsoft to develop shared tools for benchmarking genomics analysis pipelines that it hopes to release in April.
This is first time that Spiral's BioGraph solution will be available on the cloud, keeping a promise the company made when it launched the method late last year. Spiral's CEO Adina Mangubat said at the time that the company would offer both cloud and local installations of BioGraph.
Spiral developed BioGraph in collaboration with Baylor College of Medicine. The solution features a merged graph storage structure and a specialized index that helps users manage and query NGS data quickly and accurately. Users can search individual and groups of genomes and they can compare multiple genomes to identify differences between them such as structural variants.
Moving BioGraph to the cloud was crucial for a number of reasons, Mangubat told GenomeWeb this week. Many centers generate large quantities of data and store them internally mostly for security reasons. But they also need to be able to share those datasets with collaborators working in multiple centers.
BioGraph is designed to reduce data footprint and coupling it with cloud infrastructure helps address the data movement challenge, she said. Furthermore, Microsoft is an ideal cloud partner because the company "has really focused strongly on all the challenges around data security and HIPAA compliance issues and all of those kinds of things that are really required for data of this kind of sensitivity," she said. "Azure is willing to take on liability for data in a way that we just haven't seen from other cloud providers."
Spiral plans to launch a beta version of BioGraph on the cloud in April that will provide an early iteration of the solution for users to try out at a discounted rate and provide feedback. The beta release will feature all of the algorithms and functionality that will be available in the final solution but will lack the "fit and finish" of the final release, Mangubat said.
Spiral will also release a number of general-purpose queries along with the beta. These will allow researchers to, for example, search for known variants associated with given disorders by sequence or by location across hundreds of individuals at once, Mangubat told GenomeWeb this week — this could be gene fusions or structural variants. Other sample queries that will be available at the beta release would let researchers search large cohorts for putative de novo variants associated with particular childhood disorders or they could search for sequences that have a particular range of allele frequencies in both cases and controls, she said. The company will also provide an application programming interface that lets users formulate their own research queries.
Spiral plans to demonstrate its cloud offering at a presentation during the Advances in Genome Biology and Technology conference in Orlando, Florida this week. Specifically, they will use it to search reads from more than 100 genomes including 17 platinum genomes at a rate of over 100,000 queries per second.
For its part, Curoverse decided to launch its software on Azure because of "clear demand" from customers, said CEO Adam Berrey. Some clients already had good institutional relationships with Microsoft and access to products and capabilities from the company that they wanted to be able to leverage for their genomics projects, he explained to GenomeWeb. Azure also offers a number of unique features that add to its appeal. For example, some users wanted more control over where their data was located. "Azure does a good job of keeping data in the country or region where you want the data kept," he said. "That was important to folks."
Adding to Azure's appeal is the fact that customers can rapidly deploy complex clusters on the infrastructure for projects. "A cluster computing environment like Arvados takes many servers to run the different capabilities so you have to spin up multiple virtual machines that run the different aspects of the Arvados system," according to Berrey. "Azure has some good capabilities for doing complex deployments like that."
Arvados' launch on Azure also coheres with Curoverse's business strategy, Berrey added. In addition to Azure, Arvados is also available on commercial clouds such as Amazon Web Services and the Google cloud platform as well as open cloud platforms like OpenStack.
"We don't have a SaaS model so we are not like DNAnexus or Seven Bridges where we run a central service [that] people load their data into," he said. "If you have an institutional relationship with Microsoft Azure and you've established pricing for that, we'll work with you to install Arvados into your Azure account and then you'll be able to load all your data into it and use all the features and functionality of Arvados."
In addition to retaining control over the data, customers also get "100 percent transparency" on pricing, Berrey added. "Curoverse sells support subscriptions so you pay us and we will maintain the software, provide technical and end user support and run the cluster for you." However, unlike existing SaaS vendors who roll in cloud costs along with software fees, "our rates are independent of what you pay for cloud storage and compute," he noted. There may be variations in pricing but the complexity and scale of the projects in question as well as the number of people that need support drive those changes, not the cloud infrastructure itself.
In addition to making its software available on Azure, Curoverse is also collaborating with Microsoft to develop new genomic pipeline benchmarking tools that will also be available on Azure. These planned tools will be designed to address two main issues affecting genomic analysis pipelines namely performance of the tools and the accuracy and quality of their results, according to Berrey. "We are working with Microsoft Research on a set of tools that will be open source and [available to] anyone ... to do more effective benchmarking and be able to more effectively compare pipelines with each other and see if the results they are getting ... are consistent," he said.
The partners plan to release their benchmarking tools at the end of April. They will implement them in the Common Workflow Language, which provides standardized specifications for describing tools and workflows in a way that makes it possible to run them in different compute platforms and environments that support the standard including Arvados.
GenomeWeb reached out to Microsoft for comments about these partnerships and its broader plans for the genomics space where it competes with the two main providers, Amazon and Google Genomics.
Microsoft declined to comment for this article but did say in an email to GenomeWeb that it intends to continue supporting genomics research. Last year, the Genomics Institute at the University of California, Santa Cruz said that it was collaborating with Microsoft's research division to use Azure to analyze data from a number of ongoing genomics projects aimed at effectively diagnosing and treating cancer and other diseases.
Microsoft scientist David Heckerman said in a statement that the Azure cloud is the "ideal platform for genomic analysis because it provides reliable and scalable compute resources with a selection of machine configurations" and "enterprise-grade data security."
Spiral's BioGraph fits well with Azure because it lets users store genomes in a format with a small footprint, which allows collaborators and authorized users explore datasets without having to move very large data files physically, Heckerman said. In addition, "as a service layer on top of Azure, Arvados accelerates scientific discovery and clinical diagnostics using genomic and health data."