NEW YORK (GenomeWeb) – The University of California, Santa Cruz has received $11 million in grant funding from the National Institutes of Health to support the establishment of a so-called Center for Big Data in Translational Genomics, a multi-institutional collaboration based at UCSC, and to develop tools that support and simplify genomic data sharing and use.
The primary goal of the center, according to David Haussler, a professor of biomolecular engineering and director of the UCSC Genomics Institute and one of the principal investigators on the NIH grant, will be to help the biomedical community share large genomic datasets and analyze them to better understand human health and disease. To enable these activities, he and his collaborators will "rework" existing informatics systems and methods of representing and handling genomic data to better support the increasing quantities of genomic data pouring out of past, present, and future large-scale research efforts. The list of collaborators includes researchers at UC San Francisco, UC Berkeley, Wellcome Trust Sanger Institute, Sage Bionetworks, Oregon Health and Science University, California Institute of Technology, the Ontario Institute for Cancer Research, King's College London, and McGill University.
Together, the partners will standardize protocols and tools for handling and sharing genomic data and then test and hone the fruits of these efforts in four research projects. These projects include the UK10K project, led by Richard Durbin of the Wellcome Trust Sanger Institute; the International Cancer Genome Consortium's (ICGC) pan-cancer whole-genome analysis project, an effort to study 2,000 whole genome tumor/normal pairs that is co-led by Joshua Stuart at UCSC, Lincoln Stein at the Ontario Institute for Cancer Research, and others; the I-SPY 2 adaptive breast cancer trial, co-led by Laura van 't Veer at UCSF; and the Beat Acute Myeloid Leukemia therapy project, led by Brian Druker at Oregon Health and Science University.
Specifically, researchers affiliated with the UCSC center will develop "common application programming interfaces (APIs) for big genomics data in biomedicine" that can be deployed in a broad range of commercial clouds such as those provided by Amazon, Google, and Microsoft, as well as within private clouds," the abstract associated with UCSC grant states. This will result in "a rich infrastructure for genomics software developers." They also intend to build a benchmarking platform for comparing methods used to analyze genomic data, hoping to "establish the best-of-breed methods, and force collective improvement across big data genomics," the researchers wrote. Moreover, the team plans to develop analysis tools on top of the APIs for tasks such as read mapping, variant analysis, transcript analysis, pathway analysis, and data visualization, the abstract states.
Rather than invent entirely new infrastructure for the BD2K project, Haussler and his team will build on resources that are already being created by the Global Alliance for Genomics and Health (GA4GH), he told BioInform in a conversation last week. Haussler co-chairs the GA4GH's data working group with the Sanger Institute's Richard Durbin, a senior group leader and acting head of computational genomics at the institute. The data working group is the arm of the international alliance focused on establishing and enhancing open standards and formats for storing and representing data, as well as an API that connects analysis tools and data.
In August, GA4GH's data working group released an updated version of their so-called Genomics API — version 0.5 — for testing that features, among other things, cleaner models, an easy-to-use data description schema, and a web-enabled interface. The Genomics API is one of the first products to be developed and distributed by the GA4GH. In an interview with BioInform at the time of the release, Haussler said that the group planned to launch a full version of the resource at a later date that would offer a more comprehensive list of features, including increased descriptions of different kinds of variants and more standardized methods of representing them, as well as more standardized methods of recording and exchanging genetic metadata.
Some of the funds awarded under BD2K will go towards expanding those ongoing efforts and applying the tools developed in projects such as the ICGC's Pan Cancer Genome Analysis project, Haussler told BioInform last week. Besides Haussler, van 't Veer, director of applied genomics at UCSF's Comprehensive Cancer Center, and David Patterson, professor of computer science at UC Berkeley, are co-principal investigators on the BD2K grant.
One of the planned projects under the BD2K grant involves creating a more scalable implementation of the Reads API — one of the modules of the GAGH's Genomics API — that can be used on up to a million genomes scattered across multiple databases, institutions, and countries, Haussler said. Such a resource would enable researchers to run queries across multiple databases without requiring that the data be located in a centralized repository; and it would also enable them to aggregate and share data for research and clinical use, he said. The API will re-represent reads in BAM files — currently the standard for storing and sharing sequences — in an abstract format that will make it simpler for researchers to query and extract information of interest from multiple BAM files efficiently, he said.
A second project will emphasize developing standards for capturing genetic variants with an eye towards addressing issues like inconsistent mapping methods and varying variant nomenclature as well as providing more effective ways of representing all the haplotypes that occur in the human genome, Haussler said. That's important because, for example, more than half of genome-wide association studies report that a region in the major histocompatibility complex is relevant for the disease they study, he said. But there are hundreds of alternative haplotypes in the MHC making it difficult to establish a representative reference for this area in the human reference genome.
"We need to cope with that fact that the reference structure that we work with to capture standard coordinates of human variation is no longer linear; it's a graph, and that has enormous implications throughout the ecosystem of genetic analysis tools," he said. "We are hoping through the Global Alliance to steer [the community] through this transition and to set standards in that context." The planned variant module of the Genomics API leverages and expands on the well-established VCF file format but will be more flexible and precise in the way it captures variants.
Lastly, the team will also work on standards for representing functional annotation of genomes and variants, he said. They'll build on work done with organizations such as the National Center for Biotechnology Information, the European Bioinformatics Institute, and others to make available standard reference genomes and gene nomenclature as well as set standard coordinates for genes, he said.
Furthermore, as genomics-based medicine inches closer to routine reality, the biomedical community needs to address the social infrastructure surrounding the sharing of genomic data, according to Haussler. "We need to develop the legal, ethical, and social organization of shared consent so that we can share and learn from DNA sequences without threatening the privacy of individuals," he said. That's one of the goals of the GA4GH, and it will be a focus of the UCSC-led BD2K effort. For this part of the project, Haussler and his BD2K colleagues will work alongside the clinical working arm of the GA4GH, whose stated goal is to enable compatible, readily accessible, and scalable approaches for sharing clinical data and linking genomic data. Essentially, "we are trying to go bottom-up, starting with raw DNA reads all the way through to patients in the clinic," he told BioInform.
UCSC's funding is one portion of a larger allocation made by the NIH to support its Big Data to Knowledge (BD2K) initiative, an effort to develop new strategies for analyzing and using rapidly increasing quantities of biomedical data. In total, the agency said last week that it's making an initial investment of nearly $32 million to establish 11 centers — including the one at UCSC —that will focus on specific data science challenges and develop methods, software, tools, and other resources to address them.
This early investment will also establish the BD2K-LINCS Perturbation Data Coordination and Integration center, which will be responsible for coordinating data and projects associated with the NIH's Library of Integrated Network-based Cellular Signatures (LINCS) program, an effort to characterize how various cells, tissues, and networks respond to disruption by drugs and other factors. Additionally, some funds will be used to create the BD2K Data Discovery Index Coordination Consortium (DDICC), a group that will work on community-based development of a biomedical data discovery index for the discovery, access, and citation of biomedical research datasets, as well as to support the education and training of data science researchers.
The data and informatics working group of the NIH's Advisory Committee to the Director first suggested the BD2K initiative in December 2012. At the time, Lawrence Tabak, NIH's principal deputy director and co-chair of the working group, explained to BioInform that the agency had decided to revisit its infrastructure for managing and sharing large biomedical datasets in order to accommodate the explosion of available research data as well as technological advances in areas such as genomics, imaging, and electronic health records. BD2K was one of several recommendations that the data working group proposed in its report to the Advisory Committee to the NIH Director.
The agency officially launched the program — which is supported with funds from all 27 NIH institutes and centers and the NIH Common Fund — a year later. It expects to invest a total of nearly $656 million through 2020 in BD2K pending available funds.
Although three of the initial pilot projects for the UCSC-led group are cancer-centric, studies focused on other kinds of genetic diseases could benefit from the tools as well, according to Haussler. "If you can build general informatics infrastructure for genomics in cancer, [which has] thousands of potential driver mutations and more than 1,000 targeted treatment compounds in the current drug development pipelines, then this general infrastructure will be adaptable to other disease areas without needing to be scaled up," he said in a statement.