NEW YORK (GenomeWeb) – The National Cancer Institute's Frederick National Laboratory for Cancer Research (FNLCR) and FedCentric Technologies are currently engaged in a pilot project to develop and test a graph-based analytics platform that will enable cancer researchers to query large genomic datasets and gauge the impact of genetic variants on various cancer types and subtypes.
The partners announced their collaboration earlier this month but the preliminary work on the pilot actually began in June, representatives from both FedCentric and FNLCR told GenomeWeb this week. Specifically, the pilot aimed to test the feasibility and limitations of using the technology to study patterns of variation in cancer across multiple samples, Uma Mudunuri, manager of the core infrastructure and systems biology group in FNLCR's Advanced Biomedical Computing Center told GenomeWeb. It also sought to flesh out the nitty-gritty of running queries, including the best ways of ingesting and parallelizing analyses to maximize speed and obtain optimal results, Pascal Girard, chief technology evangelist for FedCentric, added.
The planned platform leverages FedCentric's high performance data analytics (HPDA) technology and graph-based analytical tools to support research queries and return results in near real time. FedCentric's graph-based analytical tools are used for various applications across industries including security and fraud prevention. In the context of cancer research, its tools let users do things like query specific portions of the genome for variants of interest as well as aggregate and store data from multiple sources and individuals.
The graph-based approach fills in gaps left by traditional relational databases and similar systems, according to the partners. Relational databases can handle large dataset sizes and return rapid results to queries, but they are not suited for some kinds of research queries, Mudunuri explained.
For instance, next-generation sequencing technologies have made it possible to sequence thousands of tumor samples under the auspices of large multi-institutional projects such as the Cancer Genome Atlas and the International Cancer Genome Consortium are prime examples, and generate thousands of variants per sample each with their own associated phenotypes. "To be able to fully grasp what that information is telling us, we realized that relational [databases were] going to be in no way sufficient," she said.
Besides supporting fast queries, FedCentric's technology was attractive because of its flexibility and capacity for growth with increasing dataset sizes, Jack Collins, the director of FNLCR's Advanced Biomedical Computing Center, told GenomeWeb. With increasing quantities of genomic and other kinds of data being generated and used at the center, a graph-based analytics approach seemed appropriate because of the ease with which new datasets could be easily integrated with existing data, and also the ease with which the system could adapt to new kinds of data besides genomic data, he said. Collins also noted that this effort complements existing efforts within the community to improve on available informatics infrastructure for cancer research including the ongoing NCI-funded cancer genome cloud pilots. Teams led by the Broad Institute, Seven Bridges Genomics, and the Institute for Systems Biology were contracted last year to develop platforms for the pilots.
Over the summer, FNLCR and FedCentric ran two development phases as part of the pilot. In the first phase, they focused on exploring the best ways to represent variant information and test drive various graph-model options; determining how many nodes and edges were required; ensuring that the model scaled as dataset sizes grew; and checking that the performance of the graph-based approach was at least comparable with existing architecture such as relational databases, Mudunuri told GenomeWeb.
Phase II of the pilot, which is wrapping up right now, is focused on actually using the models to analyze patterns within individuals and across cohorts, trying to categorize cohorts based on identified patterns, and testing how quickly such analysis queries could be run, she said. Researchers are also combining results produced by the graphs with results from statistical analysis tools such as R to see if they can improve variant pattern detection.
The pilots focused on cancer data in general, but moving forward the researchers will begin to use the platform to look for patterns that are particular to specific cancer subtypes, Mudunuri said. Part of those efforts will involve looking into what insights can be gained from exploring different combinations of cancer information as well as asking different kinds of questions of the data, she said.
The partners are now seeking additional funding to support their efforts. FedCentric provided funding for several students from Georgetown University that worked as interns on the project over the summer, and also offered the use of its laboratory facilities and supercomputing system. Meantime, FNLCR researchers invested time and research expertise in the project.
Both parties are now actively exploring funding opportunities within the NCI and external sources. FedCentric is competing in the Virginia Velocity business plan contest, and if it wins will invest the funds in the FNLCR project, Girard told GenomeWeb. This is the company's first foray into the life sciences and now that it has dipped its toe, FedCentric plans to pursue additional targets in the cancer research space, he said. In fact, several commercial companies have already approached FedCentric to learn more about the technology underlying the FNLCR platform.
"Our focus is to eventually be able to productize and ... deliver a platform that can help the bioinformatics and cancer communities," he said. In the context of cancer, "there's plenty of incentive to do this kind of work." He also highlighted the collaborative nature of the project, which included input from government, academia, and industry, noting that it jives with ongoing efforts by agencies such as the National Science Foundation to boost collaboration between these three groups.
The pilot is expected to last about a year after which the partners plan to release the first set of features and capabilities to the broader bioinformatics community.