NEW YORK (GenomeWeb) – The National Institutes of Health's National Human Genome Research Institute has awarded separate grants to two teams led by researchers at the Broad Institute and Johns Hopkins University to jointly build the NHGRI Genomic Data Science Analysis, Visualization, and Informatics Lab-Space (AnVIL), a resource for computing across large genomic and related datasets generated by NHGRI-funded centers and projects.
The Broad's team, led by Anthony Philippakis, the institute's chief data officer, was awarded $5 million, while the Johns Hopkins team led by James Taylor, an associate professor of biology and computer science, received $2.4 million. The researchers will use the funds to develop a cloud-based environment that will house unrestricted and controlled-access data and metadata from NHGRI's projects and provide compute pipelines and workflows for exploring and making sense of the data.
Full details on AnVIL are provided in a funding opportunity announcement that was issued in July 2017. Like other systems currently being developed across the NIH, AnVIL will co-locate data with storage and computing infrastructure, freeing researchers from the cumbersome task of moving large quantities of data for their use. It will include web interfaces and tools for investigators with extensive coding experience as well as those with limited coding experience. Researchers will also have the option to upload their own data and run their own software packages on the platform once it is built.
The Broad is using its FireCloud infrastructure for the project, Philippakis said in an interview. The list of collaborators on the grant include the University of Chicago, Washington University in St. Louis, University of California, Santa Cruz, and Vanderbilt University. Built on the Google cloud, the Broad's FireCloud provides an open platform for analyzing genomic data. It includes tools and analysis pipelines stored as Docker images and offers access to Cromwell, the Broad's workflow execution engine for launching genomic pipelines on cloud platforms. Other components of AnVIL will include the Gen3 infrastructure developed by researchers at the University of Chicago to provide tools for submitting, tracking, and searching for different data types, as well as the Broad's Genome Analysis Toolkit.
The project also offers an opportunity to test drive novel approaches for accessing controlled datasets, according to Philippakis. One of these is a software package called the Data Use Oversight System, which was developed by UCSC researchers to automate and facilitate restricted data use. "Many datasets will have restrictions like it can only be used for diabetes research, [so] we've done a lot of work to develop ontologies to make these [restrictions] machine readable," he explained. "We think that this could help to greatly streamline the process by which researchers access data. That's one of the things that we are trying to pilot in the AnVIL."
Other features of the AnVIL infrastructure will include containerized tools and workflows built by the Bioconda and BioContainers projects, according to JHU's Taylor. Researchers will be able to select containers with the tools and workflows that they want to use and then run these programs on their datasets.
In addition to members of the Johns Hopkins Data Science Lab, Taylor's team includes researchers from the Galaxy and BioConductor projects. "A big part of the proposal and a piece that we are really going to be focusing on is the idea of a set of different entry points for users to work with the data that's in the AnVIL," he said in an interview. One of those entry points will be the Galaxy platform, but researchers could also access AnVIL through notebook environments such as Jupyter and R-Studio. In addition, the developers are building application programming interfaces that will allow researchers use their own tools with AnVIL data, he said.
Taylor's team is also leading efforts to develop training resources for AnVIL. "We want to train scientists to use AnVIL itself, to run their workflows there and use the data that's inside. But we also want to use AnVIL for teaching state-of-the-art bioinformatics and computational biology," he said. "The hope is that this can be a platform that lets people teach genome analysis in a very realistic way where they are actually using the kind of tools and the kind of data that one would be using in the real world when we are doing these kinds of genome analysis."
With those goals in mind, Taylor and his team are taking a multi-pronged approach to developing AnVIL's user training component. Specifically, they plan to develop Massive Open Online Courses [MOOC] similar to those that members of the team have developed for the online education platform Coursera. These will be freely available to researchers who want to learn to use AnVIL as well as those who want to learn concepts from bioinformatics and computational biology. In addition, the team will offer direct training through in-person workshops as well as develop curriculum materials that third-parties can use to run their own workshops. The group plans to develop curriculum materials centered on AnVIL that can be used for teaching undergraduate and graduate computational biology courses as well.
"We are going to be using a lot of lessons from [the Coursera] materials" but "also a lot of the lessons that we've learned in developing those courses on how to build MOOCs in a more scalable, more efficient way so that we can deliver more content," Taylor said. In addition, the team is also putting together a framework that can take metadata about a given MOOC and generate much of the content in an automated manner. This should make it easier for third-party researchers to develop courses on their own.
NHGRI received a total of six applications in response to the initial AnVIL FOA. Initially, it planned to make a single $5 million award for the platform's development. However, after reviewing the completed submissions, "we realized that we didn't want an infrastructure that was just configured to one particular group," Ken Wiley, a program director in the NHGRI's Division of Genomic Medicine, said in an interview. Wiley is co-leading the initiative with Valentina Di Francesco, the program lead for the NHGRI's Computational Genomics and Data Science Program.
The two applications that were ultimately selected proposed resources and came with unique kinds of expertise that NHGRI felt would be valuable to include, Wiley noted. For example, researchers who have signed on to the Johns Hopkins application are involved in the development of tools like Galaxy and have expertise in Bioconductor, two resources that seemed like a good match for AnVIL, he said. The Broad-led team, on the other hand, has expertise in building out cloud-based applications and services that integrate well with existing resources.
Integration with existing resources is crucial because NHGRI expects that AnVIL will be a component of the NIH Data Commons, which aims to provide a cloud-based platform for researchers to store, share, access, and interact with data from biomedical and behavioral research. The program is part of the agency's Big Data to Knowledge initiative, which aims to make biomedical research data findable, accessible, interoperable, and reusable for researchers. It also includes the NCI Genomic Data Commons and the Cancer Genome Cloud pilots, which are intended to provide cancer research communities with repositories for sharing genomic and clinical data from various oncology studies.
In 2017, the NIH awarded $9 million to fund 12 projects as part of a pilot phase for the Data Commons initiative. One of those projects funded is a partnership between the Broad, UCSC, and University of Chicago. In this context, the partners are working together building a platform that can handle a mix of heterogenous data types including genomics, transcriptomics, and image data. The Broad also received a $7 million grant in 2014 to build one of the platforms for the Cancer Genome Pilots.
The developers anticipate that there will be some costs to researchers associated with using the AnVIL resource. But it is not clear yet exactly what those costs will cover. According to information provided in the FOA, NHGRI expects that researchers will at least be responsible for costs related to computing, storing, and downloading data.
"Definitely there will be a cost for the compute, but we don't know if there will be a cost for the tools yet," Wiley said. "This is going to be an evolving process … but we hope that we can keep the costs at minimum."
The exact cost structure and billing model is being crafted by members of the two funded projects in collaboration with NHGRI and an external advisory committee appointed by the agency, according to the FOA.
Furthermore, the developers are still working out exactly which datasets will be included in AnVIL. Under the terms of the FOA, AnVIL's Data Steering Committee along with NHGRI researchers and the principal investigators on each project will prioritize which datasets to include in the resource. However, because AnVIL is being built with an eye towards supporting NHGRI's initiatives, the initial focus will be on datasets from programs that NHGRI is currently funding, Wiley said.
Candidate datasets include those from the NHGRI Centers for Common Disease Genomics and the Centers for Mendelian Genomics programs. The project roadmap, including which datasets will available and when, is still being put together, but Philippakis said that the project is prioritizing access to the CCDG dataset.