To help the biomedical community build on the information provided by projects like the Cancer Genome Atlas, the University of California, Santa Cruz is developing a database called the Biomedical Evidence Graph, or BMED, that will capture and connect cancer genome analyses and interpretation results.
The five-year $3.5 million effort, which is funded by the National Cancer Institute, will also run community-wide contests similar to the Critical Assessment of Protein Structure Prediction and other experiments in order to build a pipeline composed of the most accurate algorithms for calling mutations, detecting fusion genes, and other kinds of analyses.
"The idea is to build a shared knowledgebase and create a playground where lots of researchers can interact, test their algorithms, and compare results," Joshua Stuart, an associate professor of biomolecular engineering, said in a statement.
It's also an attempt to bridge the gap between the petabytes of raw genomic information housed in centralized repositories such as UCSC's Cancer Genomic Hub — a $10 million petabyte-scale data repository that holds genomic and clinical data from several NCI-funded cancer genome research programs (BI 5/4/2012) — and higher levels of data interpretation by collecting evidence such as somatic mutations, structural variations, and pathway level information.
These repositories provide ready access to raw sequence, Stuart noted during an interview with BioInform, but the data still has to pass through several levels of analyses, such as variant calling and pathway analysis, before it can be used to make clinically useful predictions like which drugs would be most effective against particular tumors.
"This proposal is trying to sit between the outcomes and the raw sequence data" by capturing "all the information from lots of groups, all of our best work" in a single resource so that "we can all make use and leverage and bootstrap off each other," he said.
The project will also attempt to standardize tools, first for doing basic cancer genome analysis — things like mutation calling — and then work its way up to tools for predicting outcomes like drug response and patient survival. Setting standards will also ensure consistency in the data that are collected and shared through BMEG and also prevent errors from being propagated up the analysis chain.
Currently, many researchers use internally built tools and algorithms that often produce varying results. For example, Stuart said that there are more than a dozen algorithms used by different institutions to call cancer mutations, all of which give different answers. "That was a big shocker to me when the TCGA and International Cancer Genome consortiums started comparing the algorithms used by different institutions," he said. "Only in the last year have we sorted that out, and … created a unified effort to identify mutations for TCGA."
Part of the problem, he explained, was that some of the tools were adaptations of the methods used to call hereditary mutations. But tumors have "heterogeneity to them," he said. "They have different subclones and … mixture[s] of tumor and normal" cells that makes the task of calling variants much harder than "doing variant detection in germline analysis" where the samples are more homogeneous.
To find out the best algorithms, the UCSC team is partnering with Sage Bionetworks — which is also helping to organize the Dialogue for Reverse Engineering Assessment and Methods challenge (BI 4/26/2013) — to run a series of blinded competitions that will compare algorithms for things like calling mutations, detecting fusion genes, quantifying mRNA levels, and more. The plan, Stuart said, is to run the first challenge, which will focus on mutation calling, within a year.
The winning algorithms will be incorporated into existing pipeline and workflow tools such as the Broad's Firehose — the infrastructure used to coordinate the tools used to analyze TCGA data — and will be made available to BMEG's users, he said.
Initially, the researchers will focus on analyzing TCGA data but they will expand their scope to include other government-funded research projects such as the Therapeutically Applicable Research to Generate Effective Treatments, or TARGET, program in the near future. They are also looking into incorporating data from the International Cancer Genome Consortium, Stuart said.
BMEG will complement parallel efforts like MedBook, an online platform that's designed to link patients, biopsy samples, doctors, and researchers in a social network framework by providing "patient-level genome information," Stuart said. He and other colleagues developed MedBook under a separate $10 million three-year prostate cancer initiative funded by Stand Up to Cancer, the Prostate Cancer Foundation, and the American Association for Cancer Research (BI 10/12/2012).
Like MedBook, BMEG uses the database graph structure adopted by social media networks like Facebook to represent and store the information it collects. "These graph databases scale really well" and they are also better for "connecting lots of evidence together" than traditional relational databases, Stuart said.