NEW YORK (GenomeWeb) – The Broad Institute of MIT and Harvard today announced a five-year $25 million collaboration with Intel Corporation that they say will scale researchers' ability to analyze large quantities of genomic data from diverse and disparate sources.
As part of the collaboration, the partners have established the Intel-Broad Center for Genomic Data Engineering. Researchers and software engineers there will build, optimize, and share new tools and infrastructure that will help scientists integrate and process genomic data. They will also provide hardware recommendation for running genomic workloads using on-premise infrastructure as well as public and hybrid clouds. With the infrastructure, researchers will be able to run more data-intensive studies and generate results more quickly, as well as access datasets that may have been unavailable to them before.
"We plan to build out solutions that can work across different infrastructures to facilitate efficient processing of these growing data sets, and then make these tools openly available for researchers worldwide," Eric Banks, director of the data sciences and data engineering group at the Broad, said in a statement.
The current agreement builds on an existing collaboration between the partners that began nearly three years ago and has evolved and deepened over the years, according to the Broad Institute's Chief Data Officer Anthony Philippakis. "Intel brings this incredible amount of expertise in [terms of] hardware as well as scalable engineering [while] Broad brings a lot of expertise in genomics and genomic data science," he told GenomeWeb this week. "Over the last three years, our teams have continued to work together to develop a number of solutions to enable genomic data processing at scale."
For example, earlier this year, the Broad announced partnerships with several cloud-computing vendors to implement the Genome Analysis Toolkit (GATK) on their respective infrastructure. The institute said at the time that it was collaborating with Intel and others firms to build the next iteration of the GATK — version 4 — including optimizing its performance and developing tools that would simplify the task of executing GATK on different clouds.
Their efforts included extending the capabilities of Cromwell, a workflow execution engine developed by the Broad for launching genomic pipelines on private and public clouds in a portable and reproducible manner. Intel also worked with the Broad to develop an improved method for storing and processing variant data called GenomicsDB. Previously, Intel released an optimized version of the GATK as one of the tools on its Optimized Code website through which it offers access to several popular open-source bioinformatics analysis tools that were optimized to run on Intel's Xeon processors.
Under the terms of the new collaboration, the partners will continue to develop GATK and the other aforementioned tools but they will also expand their scope to offer additional resources to the community. This award allows the partners to "expand what we were doing and opens up the opportunity for a number of new areas that we weren't even thinking about before," Philippakis said.
This includes providing reference architectures for a single-node, small data center, and public cloud infrastructure to help the community figure out what types of infrastructure are best suited to genomic workflows, he told GenomeWeb. These resources will offer recommendations for running genomic workloads on on-premise infrastructure as well as public and hybrid clouds. "That's a new area that we haven't done previously but something that we really feel can impact the scale of genomics and genomic data science," he said.
The partners will also work on developing faster implementations of the GATK, Cromwell, and GenomicsDB that are optimized for industry-standard Intel-based platforms. As part of these efforts, the partners will explore the use of advanced chipsets including field-programmable gate arrays, Philippakis said. However, even though there will be versions of the tools optimized for Intel hardware, that does not mean that they will only work with the company's systems, he noted. "At the Broad, we've always followed a very strong rule of non-exclusivity and I know that our partners at Intel fully embrace that," Philippakis said.
Kay Eron, general manager of health and life sciences for Intel's data center group, expressed similar sentiments. "We believe that [our] goal is best accomplished through this kind of non-exclusive licensing which allows many companies to use the innovation," she told GenomeWeb. "In some cases we recognize that there could be some exclusivity but we are really focused on ... using the non-exclusive licensing approach."
Lastly, the partners plan to develop tools that will allow healthcare providers, pharmaceutical companies, and academic research organizations share and use complex, distributed, and often siloed datasets for research and discovery, drug discovery, clinical trial recruitment, and other use cases. As part of these efforts, they will evaluate the use of tools and techniques such as secure multi-party computation, Philippakis said. The partners plan to make many of the tools that they develop and improve available open-source. It is possible that some tools could have restricted access but it is too early to say whether or not that will be the case, Eron said
In 2015, Intel partnered with researchers at the Oregon Health & Science University's Knight Cancer Institute to launch the Collaborative Cancer Cloud (CCC), a platform-as-a-service infrastructure that is designed to enable hospitals and research institutions to share private genomic, imaging, and clinical datasets without compromising the privacy of the contributing patients. In April, researchers at Dana-Farber Cancer Institute and the Ontario Institute for Cancer Research signed on to participate in pilot projects aimed at testing the CCC's efficacy.
Moving forward, Intel expect to continue to engage with the genomics community through its existing and future partnerships. "The long-term goal for us is to expand upon these tools to enable joint analytics with other large genomic research centers in a federated and secure model regardless of the location of the data," Eron told GenomeWeb. "This is a really exciting and huge opportunity for Intel so definitely we are focused on this."
For its part, this is one of several major collaborations that the Broad has announced in recent days. Earlier this week, the institute said it has partnered with IBM to launch a $50 million research initiative that will use IBM Watson's cognitive computing capabilities to study cancer drug resistance. As part of this study researchers will analyze data from over 10,000 samples, one of the largest to date according to the partners, and they plan to make anonymized data from the studies available to the scientific community for research use.
In addition, the Broad in collaboration with the American Heart Association launched My Research Legacy, a secure website through which people can share health data with researchers looking for new ways to treat heart diseases and stroke. Broad and AHA also announced a pilot study that will focus on 2,000 people who survived a heart attack, stroke, atrial fibrillation, aortic dissection, or systolic heart failure when they were between the ages of 21 and 50. Through the website, these patients will be able to provide de-identified lifestyle, health, and genetic data, the partners said.
"We live in a time and place where a lot of great data scientists and data engineers are not just in academia but also in industry," Philippakis said. "It's been a great opportunity working with groups like Intel to bring new ideas and techniques into our community and this is something that we are very excited about moving forward."