NEW YORK (GenomeWeb) – This week Spiral Genetics launched a new graph-based technology, developed in collaboration with Baylor College of Medicine, that lets researchers compress and run research queries across large numbers of genomic samples quickly and efficiently.
The company's BioGraph product provides a novel method of indexing next-generation sequencing data that lets researchers run millions of queries a second, as well as a method of storing data that scales less than linearly with the number of samples being analyzed, according to the company. Spiral Genetics CEO Adina Mangubat told GenomeWeb that the company will offer both cloud-based and local installations of its BioGraph product and that it will charge a fee per sample analyzed. She said the company is not announcing an exact price point at this time but expects to release pricing details in the near future.
Spiral is currently working with collaborators at BCM's Human Genome Sequencing Center to use the query technology to explore structural variation across multiple samples as part of a pilot project. There are other groups whom Spiral has approached for pilot projects but it is not disclosing who these potential partners are at this time.
William Salerno, a senior staff scientist at BCM, told GenomeWeb that the partners are setting up initial tests right now and have identified some datasets for this early stage, but they plan to apply the method to a large cohort that will have several hundred up to possibly a thousand samples. They hope to present some results from this much larger analysis at the Advances in Genome Biology and Technology Conference to be held in February next year.
According to Mangubat, BioGraph addresses an existing need for effective mechanisms for organizing and searching large quantities of genomic data quickly without compromising on the accuracy of the results. Researchers currently have access to thousands of samples and are able to run detailed studies that can span years, and they require effective tools for exploring and comparing samples, she said. Furthermore, researchers have to cope with rapidly morphing technologies that could impact the results of their projects. For instance, a researcher might be 5,000 samples into a 20,000 sample project when an update to the variant caller being used for the analysis reveals variations that were not previously visible, she said. One option is to rerun the initial 5,000 samples through the updated variant caller but that could take months of additional computation.
BioGraph is designed to make these sorts of computation challenges feasible. Essentially, the approach includes a method of preprocessing input files and putting them into a graph format, which both captures the sequence reads in their entirety and makes them much easier to search. Unlike the standard approach for variant detection, which aligns input sequences to a reference, looks for mismatches, and then generates a list of variants leaving out information about the reads from which the variant calls were made, in BioGraph reads are indexed in stacked graphs that make it easy for users to search for variations by following paths though the graphs. It's also very fast. For example, Mangubat said, a researcher could search for evidence of 1 million variations of all types in over 1,000 samples in about two days. Doing the same analysis would take months with other methods, she said.
The company's approach could be used to detect various kinds of variants including SNPs, insertions and deletions, and structural variation in the context of both research and clinical applications. For the BCM collaboration, the partners are focusing specifically on detecting structural variation across large numbers of samples.
"Typing structural variants allows for them to be used just like SNPs when completing large-scale association studies to understand disease ... [however,] it has been very difficult to accurately identify structural variants using callers, especially when looking across larger studies, where the effect of high false discovery rates can compound errors," Salerno said in a statement. Spiral's technology makes it possible "to quickly search genetic data and set up queries to answer specific questions, [which] is crucial to the success of large scale genetics studies that we are embarking on, especially as we continue to add more individuals."
BioGraph adds to Spiral's existing suite of solutions based on its proprietary method of detecting structural changes in short-read sequencing data. It works by comparing unmapped reads to a reference set, de novo assembling input reads that are not perfect matches to the genome, and then mapping the remaining reads that are matched to the reference. So far the company has launched two products, Anchored Assembly and Onco Assembly, based on the methodology. The company has also developed the so-called Spiral Encrypted Compression, a lossless compression method that reduces sequence read and alignment files to half their original size.
Early this year, BCM researchers published a paper in BMC Genomics that described how they used both long- and short-read sequencing technologies, assembly methods, and mapping tools to characterize structural variation in a single diploid individual — the HS1011 genome. They hoped to create a human reference diploid genome that could serve as a standard for structural variant typing projects.
"The idea was to take every possible data type and program we could think of, run as much structural variation as we can, combine them, and then evaluate using hybrid short-read and long-read technology," Salerno told GenomeWeb. As part of that study, the researchers tested various structural variant calling methods that are designed to work with Illumina sequence data including Spiral Genetics' methodology. They compared the results provided by the variant callers to hybrid assemblies including one generated using a combination of Illumina and PacBio data.
"We really liked Spiral's method because they had a very high specificity, they didn't make a lot of false positive calls, [and their] results were corroborated by our hybrid assembly method," he said. That's important, according to Salerno, because it demonstrates the feasibility of capturing much of the same structural variation using Illumina data that would be found using longer-read technologies. Practically speaking, since many labs may not be able to do 10x PacBio coverage for all their whole genomes, "we are very interested in providing structural variation detection, assuming that people only have short-read, 100-150 paired-end base pair reads," he said.
For their next steps, the Baylor researchers wanted to be able to scale up their SV analysis to tens, hundreds, and even thousands of samples. For the BMC Genomics study, they worked with a single genome, but they've also used existing tools to explore SVs in sample sets of three and four individuals, Salerno said.
But the analysis becomes much more complicated as the number of samples increases. "You have this N+1 problem where you can take 100 samples, characterize them for variation, and make this nice summary, but then ... the question is how do you add that next hundred samples or the next however many samples to what you've previously done without re-computing the whole project?" he said. "That problem is sort of solved for the small variants, but we are thinking about how to do that for larger structural variation."
That's where Biograph comes in. Its compression technique reduces sequence information in such a way that it has a much smaller storage footprint but still contains sufficient information for evaluating structural variants, Salerno said. This makes it possible to store a large number of BAM files at relatively low cost and still retain access to the original read data. That's useful because if a variant is not called, a researcher can go back and look for evidence of it in the input reads. Furthermore, the indexing technology makes it possible to run research queries and obtain results very quickly, in microseconds in some cases, so analysis speed is not a bottleneck, he noted. Also, researchers can vary the queries that they run on the data and identify different types of variants.