NEW YORK (GenomeWeb) – After completing a proof-of-concept project with the National Cancer Institute's Frederick National Laboratory for Cancer Research (FNLCR), FedCentric Technologies now wants to use its graph-based analytics technology in real-world use cases starting with pediatric cancers and Zika virus studies.
The company is also mulling commercial offerings that will provide the life science market with hardware and graph-based analytics tools for querying and analyzing genomic data, Michael Atkins, director of FedCentric's bioscience division, told GenomeWeb.
FedCentric representatives told GenomeWeb that the company is already collaborating with FNLCR on some pediatric cancer data analyses and is seeking approval to work on other proprietary datasets including one from an unnamed institution. The company also hopes to be able to get some data from the NCI, Atkins said. He could not disclose specific details about the pediatric projects and the subtypes that the company wants to work with because some of its applications and approvals are still pending and it has signed non-disclosure agreements for other datasets.
The company also plans to pursue partnerships with agencies like the US Centers for Disease Control and the National Institute of Allergy and Infectious Diseases, as well as with academic institutions, on Zika virus studies, Pascal Girard, FedCentric's chief technology evangelist, said. Meanwhile, the company has started some internal studies on the virus, such as exploring evolutionary relationships between zZka and other viruses, Atkins said. FedCentric researchers are also comparing gene expression across viruses to try to understand how gene expression in Zika might link to microcephaly and other conditions.
FedCentric started working with FNLCR last June on a pilot project to test the feasibility and limitations of using its high-performance data analytics technology (HPDA) and graph-based analytics to study variation patterns in large quantities of cancer data. FedCentric presented a poster at this year's Bio-IT World Conference — which was judged best poster at the meeting — that described the pilot and the graph architecture that the partners developed.
For the study, the researchers built a graph model and populated it with gene variant and population data gleaned from several open source public databases including SNPs from the 1000 Genomes project, phenotype information from ClinVar, and amino acid changes from UniProt. The final structure had 180 million nodes and nearly 12 billion edges, according to FedCentric's poster. It included variants and annotations mapped to reference genomic locations as well as all chromosomes and genomic locations.
Phase I of the project focused on running simple queries such as locating information on a single variant or finding all variants associated with specific clinical phenotypes. The researchers also assessed performance speeds and ingestion times for the graph structure. According to the poster, FedCentric's graph model had similar speed and performance to a relational database for single variant searches — query times were in milliseconds in both cases.
Phase II centered on queries that were more complex such as finding closely related individuals by comparing patient profiles or finding population clusters by comparing annotation profiles. With help from an external team of mathematicians and data scientists from the Massachusetts Institute of Technology and Carnegie Mellon University, the researchers used spectral clustering and k-means analysis to look at patterns in the combined datasets and to categorize cohorts, Atkins said. He said that they were able to run these kinds of queries in under two minutes. There are no comparable performance times because FNLCR had not been able to run these sorts of searches with their relational databases.
"Everyone knew a priori what to expect because you would expect that certain gene types would cluster with [other] gene types. Whether the model and the graph technology could do [it] was what was in question," Atkins said. "To be able to do spectral clustering of something that size in two minutes is significant."
The researchers have also optimized the system in the last several months including reducing its memory utilization by two orders of magnitude. They have also improved its data ingestion speed by about the same amount, Girard added.
Now that they have been able to demonstrate that their graph models work the way that they should, FedCentric wants to provide its technology to researchers working in other disease areas to help them query data in new ways. They chose to work immediately on pediatric cancer and Zika first, in part, because of pressing needs in these areas for more effective treatments.
Pediatric cancer researchers want to minimize the stress and trauma of repeated testing and treatment that children with cancer go through, Atkins noted. Pediatric oncology is also of particular interest to the company's FNLCR partners and something that they were keen to work on next, he added. Meanwhile, the current Zika outbreak has raised urgent questions about the virus' genomics and epidemiology and these are things that "lend themselves to big data analytics, which is what we do," he said.
However, FedCentric is open to projects in other disease areas that potential clients find interesting. Atkins said that the company hopes to secure contracts in the coming months to work on various research projects with large genome centers and other agencies. It is also seeking additional funding from various government contractors and private sources to support further technology development and prep for commercialization. It hopes to raise a total of $5 million this year to push these efforts forward, Atkins said. It is not yet seeking venture funding, in part, because it wants to gather more supporting evidence for its approach, but it could be open to that option in future, he added.
Within a year, FedCentric hopes to have products ready for the market including a combined hardware and software solution and a cloud-based software option, Atkins said. It will offer general versions of these solutions as well as custom work for specific clients. Exactly when these products will launch will depend in part on whether or not the company is able to secure the funds it needs for development, he said. Pricing is also still being determined at this point.
The company already has partnerships with hardware vendors SGI and Intel and its hopes to have a deal in place with an unnamed sequencing vendor, Atkins said. He also said that FedCentric could potentially spin off its15-person biosciences division as a separate subsidiary under a different name to indicate its focus on genomics. But for right now the focus is on technology development, he said.
There will be competition in terms of the analytics from companies like Seven Bridges Genomics and Spiral Genetics, both of whom have graph-based software solutions. Where FedCentric differentiates is in its expertise in building optimized hardware solutions and in tailoring its software solutions to work with its hardware, Atkins said. The company has a $2 million supercomputing lab outfitted with two supercomputers including an SGI UV300 system with over 1,100 cores and 64 terabytes of shared memory. These systems support its work with FNLCR as well as other projects. It also has hardware systems experts on staff that previously worked for supercomputing vendors SGI and Cray.
Genomics is a new area for FedCentric but the company has built big data analytics systems for other industries such as the intelligence community. For example, in 2013, it won a $16.7 million contract from the United States Postal Service to provide a 48-rack high-density supercomputing solution and software for processing streaming mail piece data. "We can do very similar types of things for large organizations in the genomics space like the Broad or the New York Genome Center," Atkins said. "We have the engineers and architects who are capable of customizing a hardware solution that our analytics are built on top of and our graph-based analytics, in particular, are extremely fast as a result."
FedCentric will offer a cloud-based software option on Amazon Web Services for those customers who want it. Atkins said that both companies are discussing the details of how FedCentric could offer its services on the cloud. Here the company will compete with both Seven Bridges, which has installed its platform on Amazon and Google clouds, and Spiral, which has installed its platform on Microsoft's Azure cloud.
However, Atkins believes that FedCentric's local alternative will have a lot of traction in the space especially with larger centers that do not see the cloud working for them because of the impracticality of moving large quantities of genomic data into the cloud.
"You literally have to ship discs to the [vendor] to put them on the cloud and you would have to do that regularly because they are updated constantly," he said. "That's not going to work for many of the large-scale applications that they are doing [and] they are going to have to go back to developing on-premise solutions." In addition, some organizations have proprietary datasets and sensitive information that they will not want placed in public clouds for fear of exposure, he noted.