Focus Biology, a bioinformatics startup whose founders originally wrote software to optimize manufacturing processes for the semiconductor industry, has developed a new approach to SNP biomarker discovery that it hopes to develop further through research collaborations — and ultimately license to a larger firm.
The two-person company, based in San Jose, Calif., recently posted a whitepaper
on its website describing the method’s ability to predict SNP biomarkers using data from a genome-wide association study on late-onset Alzheimer’s disease.
The method, based on a concept called “multi-allele ensemble discovery,” differs from other classification methods that identify lists of genes, SNPs, or other molecules that collectively act as a “signature” or “fingerprint” for a given phenotype.
Rather than a list of single molecules, the biomarker set that the Focus software predicts comprises SNP allele pairs. Each allele pair is 100-percent predictive for a given phenotype, but only within a very small subset of the population. So after identifying those allele pairs that are 100-percent predictive for those subsets — a very large combinatorial problem in its own right — the method then determines which of those pairs it takes in aggregate to account for all samples in the population. The end result is a list of allele pairs that the company calls a multi-allele ensemble.
“We think of each multi-allele as a classifier, but with a limited domain,” Jim Shaw, chief technology officer at Focus, told BioInform. “What we do is then look for the aggregation of alleles that together cover a large portion of the population.”
Shaw said that the company began developing the method about three years ago. At the time, he and Focus co-founder Gerri Shaw were working for another firm they had founded called BergenShaw, which had developed process-optimization software to improve quality control for laboratory-information systems.
BergenShaw’s roots were in process optimization for the semiconductor, flat-panel display, and hard disk drive industries, but it turned to the life-science market following the dot-com bust. It soon found a following among organizations with large-scale sequencing pipelines, including Incyte, Celera Genomics, Delera Diagnostics, and the Joint Genome Institute.
Shaw said that the SNP biomarker-discovery method grew out of a side project BergenShaw had undertaken with Celera Diagnostics that involved analyzing genotyping data.
“The thing that struck us most was that people were still working from what I call the ‘my favorite gene’ perspective – ‘I’ve been working on this gene for the last 20 years, and I know I’m going to find some important use for it,’” he recounted.
“If the dog eats the dog biscuits, then perhaps they’re worth selling, and that’s the instance here.”
“And that struck us as being like a process engineer saying, ‘I’m really responsible for this particular type of instrument in this process and I’m going to find all the areas that instrument may be contributing to in the process.’ Well, they may be able to do that, but we thought a better perspective might be the perspective we applied in process optimization: Let the process itself tell us what is going on,” said Shaw.
“What we had found in the past when looking at large factory operations, like gene sequencing or disk-drive manufacturing … is that, often, what appeared to be problem at the end of the manufacturing line … wasn’t caused by a single event or a single thing. It was caused by several things marginally changing,” he said. As a result, they decided to look for combinations of factors that might be correlated with phenotype.
According to Shaw, the challenge of looking for SNP combinations, however, is that each SNP can have three possible combinations: homozygous for the major allele, homozygous for the minor allele, or heterozygous.
Pairing up each of these alleles and determining the predictive value of the pairs is therefore “a very large combinatorial problem,” Shaw said. “If you were to take this data set and try to run it with a very large number of computers and look for all the combinations of all the particular SNP alleles, it would take years to complete.”
In order to reduce that time, Focus has developed a proprietary algorithm that is able to identify “features” in the data set that are “useful in the data-mining process,” he said.
Shaw said that Focus can typically analyze a GWAS data set in four to six weeks and that he has used the method to analyze a number of GWAS data sets, but these projects have all been under non-disclosure agreements with collaborators. The whitepaper, therefore, represents the first time the company has been able to publish its results.
“We were looking for the opportunity to work with a data set that we could publish on so that we could demonstrate the power of our software,” Shaw said. “We had good results but unfortunately we had no way of actually demonstrating that to the public.”
Last year, researchers from the Translational Genomics Institute published an Alzheimer’s disease study in the journal Neuron
based on three genome-wide SNP-association data sets, and subsequently released the data sets through the TGen website
Focus used one of the three TGen datasets to identify a multi-allele ensemble that would predict which ApoE-epsilon-4 carriers developed late onset Alzheimer’s disease, and used the other two datasets to test its efficacy. The sensitivity of the resulting multi-allele ensemble was “at or near” 100 percent in all three data sets, while the specificity was 100 percent in all three data sets, according to the company.
That work is described in the recently released whitepaper and Shaw said the company is preparing a manuscript with more details about its methodology that it plans to submit for publication in a peer-reviewed journal in about two months.
Wendell Jones, senior director of statistics and bioinformatics at genomic services firm Expression Analysis, noted that it will be key for Focus to provide more information to prove that it performed its training, testing, and validation properly.
“Without this information, I cannot be conclusive in an opinion,” Jones told BioInform via e-mail. “What I can say is that their claim is extraordinary, which implies that their methods must pass extraordinary scrutiny.”
Jones added that the use of ensemble methods “is not new,” but the way they are using it appears to be “somewhat novel.”
The main challenge for the approach, he said, is the dimensionality problem. “There are many SNPs and combinations of SNPs that will have these types of characteristics, but they will do so by chance and won't hold up under independent and prospective validation.” The risk, he said, is that the method might “memorize the training data and cannot generalize to new instances (new patients).”
Shaw said that Focus is currently looking to work with other research groups interested in sharing their genome-wide association study data sets so that the company can further validate and refine the method. In addition, Focus plans to analyze publicly available GWAS data from the Genetic Association Information Network.
“Our intention was to get this paper out and to get some interest in what we’re doing,” but ultimately “our real intention is to license our technology to an established company,” he said.
“We don’t want to build a fully integrated company. What we want to do is continue to refine this technology and a couple of other follow-on technologies that apply to genomics, proteomics, metabolomics, and so on, that are in the same vein.”
Shaw said that so far, Focus has not discussed the technology with potential licensors because it was waiting to have something to show them. “When you talk to someone about a new technology, everybody has the opportunity to say, ‘I’m not sure it’s going to work.’ So what you really need is a demonstration of it really working,” he said.
“If the dog eats the dog biscuits, then perhaps they’re worth selling, and that’s the instance here,” he said. “We wanted the technology to be demonstrated, and that’s why we did this study and why we wrote this paper and it’s why we’re going to do a few other studies.”