Isis Innovation, Oxford University’s technology-transfer company, said this week that it is looking to commercialize a software package called QuantiSNP that analyzes gene copy number variation and SNPs.
If successfully commercialized, the software, developed by a team of researchers at Oxford, the Wellcome Trust Centre for Human Genetics, the UK’s Medical Research Council, and the University of Toronto, would compete against tools from several bioinformatics companies, including BioDiscovery and Golden Helix, which both released copy number-analysis software packages over the last year.
Both firms have recently accelerated their focus in this area. In January, Golden Helix formed a partnership with six academic research groups in an effort to improve the Copy Number Analysis Module for its SNP & Variation Suite [BioInform 01-18-08]. And last week, BioDiscovery said that it had signed a deal with Roche NimbleGen to integrate the company’s microarrays with its Nexus Copy Number software [BioInform 06-06-08].
Chris Holmes, a statistician at Oxford University who co-developed QuantiSNP, noted that the jury is still out on how these packages may compare in terms of performance because there has not yet been a head-to-head comparison of the tools. “Hopefully some of the groups will report independent validation studies,” he said in an interview with BioInform.
QuantiSNP uses an objective Bayes hidden Markov model to detect copy-number variations in data from Illumina’s BeadArray SNP-genotyping array platform. The Oxford researchers are currently working to extend QuantiSNP to other array platforms.
In a paper published in the March 7, 2007, issue of Nucleic Acids Research, the researchers wrote that compared to Illumina’s BeadStudio LOH+ software, QuantiSNP was better at identifying “both novel and validated” copy number variants and was able to “significantly improve the accuracy of segmental aneuploidy identification,” or detecting chromosomal copy number changes that can be involved in developmental defects and cancer.
In the paper, the scientists showed that QuantiSNP could produce “accurate copy number detection and high-resolution breakpoint identification.” The software inferred copy number variation for a total of 18 samples that had previously been characterized, 15 samples of differing genetic alternations, and three control samples: It mapped 12 of 15 breakpoints whereas BeadStudio mapped six.
With the Markov structure, “if I am trying to call the copy number variation at a particular locus, [then] knowledge of what is happening to my neighbors gives me information about my copy number,” Holmes said. The rationale is that copy number variations tend to span a few SNPs, lending “strength to joint inference,” he explained.
The QuantiSNP algorithm involves three components: the hidden Markov model, allele frequencies, and Bayesian statistics. “It means we have a probability score assigned to each possible copy number variant type,” which is an important characteristic for association studies, particularly in the case of complex diseases, Holmes said.
In an association study, researchers might run their genotyping algorithm, take “the best guess” and then plug that into a pattern-recognition tool, he said. “That’s fine if you can make very, very accurate calls, but if your calling is uncertain, which it is particularly true for copy number variation, then … you’ve got extra uncertainty, which you haven’t accommodated.”
Making the Call
Holmes said that QuantiSNP addresses this problem by adding statistical rigor to the analysis process. Compared with other calling algorithms, “the use of probability model with the known Markov structure gives it the increased precision,” he said.
Every calling algorithm must generate a list of calls ranked according to degree of confidence. “We believe QuantiSNP is the most accurate way of giving you that list,” said Holmes.
The next issue is selecting a cutoff value on that list. “If you put the cut right at the bottom of the list, you are going to get a lot of false positives,” he said, but “if you put that cutoff right at the very top … then of course you have a low false-positive” rate.
“How you threshold that … is a tradeoff between false positives and false negatives,” he said.
As Holmes explained, researchers have been looking at copy number variation signals, “like a time series.” On one axis is the genomic location and on the other signal intensity. “And you are trying to segment that into deletions where the signal drops and duplications where the signal rises, but actually you have a third axis, which is the allele frequency.”
Holmes said that QuantiSNP takes the intensity of the probe into account, which involves relating the intensity of a single SNP normalized against a median of reference samples, but also the allele frequency. “This is something people haven’t considered previously,” he said.
Where a string of homozygotes is detectable, he explained, that is a “strong indication” that there is a deletion. In his assessment, using every piece of useful information “increases your precision and decreases your false positive rate.”
QuantiSNP also addresses another hurdle in analyzing copy number variation — the signal-to-noise ratio, which is “much more challenging” than it is in genotyping, Holmes said.
Copy number variation is “a noisy problem, but the key thing is that the noise is independent from marker to marker, [but] the signal isn’t,” he said. “The signal kind of persists across the markers,” so that a consistently low signal indicates a deletion and a high signal aggregates evidence of a duplication.
“The algorithm is designed to pick up stretches of this increased or decreased signal and the hidden Markov model is an extremely computationally efficient way of doing that,” he said.
While hidden Markov models, though fast, have been criticized for being inaccurate, Holmes called QuantiSNP “an extremely efficient inference.”
“You get a big reduction in false positives because of that, [and] it allows you to filter out the noise effectively,” he said.
“We are extremely confident about the precision of the method; many people have evaluated it and the feedback has been very positive,” he said. While he declined to disclose any users of the software, he said that “a number of large companies” are currently evaluating it.
“QuantiSNP is not only useful for copy number variants, but also for the detection of aneuploidies,” Ioannis Ragoussis, head of genomics at Wellcome Trust Centre for Human Genetics at the University of Oxford, told BioInform via e-mail. “This can lead to an immediate diagnosis relevant for congenital defects or genetic diseases such as Duchenne muscular dystrophy,” he said.
QuantiSNP can be linked to other software tools that are currently under development within the Oxford group, including tools that process data from different sample collections and others that perform formal statistical analysis, said Ragoussis. “We have a tool able to do the former now, and Chris Holmes is working on tools that will allow the latter; we hope that we will have a fully integrated solution in the near future.”
The Power of Numbers
Commercial bioinformatics firms in the copy number-analysis market don’t perceive QuantiSNP as a competitive threat just yet. As BioDiscovery CEO Soheil Shams explained to BioInform in an e-mail, whereas QuantiSNP and other academic tools have tended to focus only on the calling algorithm, commercial software tools like BioDiscovery’s Nexus also support downstream analysis.
Shams said that Nexus was designed to analyze copy number calls for thousands of samples and can “quickly … identify regions of statistically significant copy number change between two populations.”
“Historically much copy number analysis has been focused on a few samples. It’s a different ball game when you go up to thousands of patients and doing whole-genome arrays where you have 1 [million] to 2 million markers per patient.” |
Nexus “can perform unsupervised clustering using a unique approach based on copy number profiles on all samples and then perform survival analysis on identified clusters,” he said. The software can also integrate gene-expression results with copy number data to identify genomic “hot spots,” he said.
These features “place Nexus Copy Number in a unique class by itself at this time and we currently don’t see any academic or commercial package that provides such a powerful system in a very user-friendly interface,” he said.
Nexus uses a different calling algorithm than QuantiSNP to estimate breakpoints and DNA copy number at various genomic regions. The Rank Segmentation algorithm is based on the circular binary segmentation algorithm “with a number of modifications that make it more robust to noise and greatly improve performance,” he said.
“We have not done head-to-head comparison of the algorithms together but have been told by some of our users who have done tests that Rank Segmentation has performed better for them than QuantiSNP and some of these results are being submitted by them for publication at this time,“ Shams said.
Shams also pointed out that Nexus supports SNP file types from Illumina and Affymetrix as well as two-color CGH data from Agilent Technologies and Roche Nimblegen.
Shams added that BioDiscovery has established a workflow for scientists in this area. “We have designed Nexus Copy Number in such a way that we can directly link to Bioconductor and run any calling algorithm that the user wants or import data after copy number estimates are performed in an external package,” he said.
“We believe that although the calling algorithm is very important, it is just the initial step in the copy number-analysis process.”
Golden Helix CEO Christophe Lambert echoed Shams, noting that his company’s software also goes beyond QuantiSNP’s area of call algorithms to look at the entire “soup-to-nuts workflow” of processing data and removing batch effects for association studies analyzing thousands of samples.
“Historically much copy number analysis has been focused on a few samples,” he said. “It’s a different ball game when you go up to thousands of patients and doing whole-genome arrays where you have 1 [million] to 2 million markers per patient,” he said.
Lambert said that while circular binary segmenting has been the “gold standard” for segmentation, it’s “a fairly compute-intensive algorithm that involves some approximation.”
Golden Helix’s segmentation algorithm instead uses dynamic programming to “exhaustively look through all possible segmenting of the data to find the statistically optimal one,” Lambert said. “We find that we can do better than binary segmenting by a little bit, but you are pretty much hitting the limits of signal-to-noise ratio of the data.”
“Circular binary segmenting has outperformed hidden Markov model approaches of the past,” he said. “Whether QuantiSNP, which is a hidden Markov approach, has improved over those past attempts, we would like to see published results,” Lambert said.
He noted that it is very difficult to assess the performance of different packages because there is currently no “gold standard” data set that can be used as a benchmark for comparison.
“The large datasets that are used, such as the HapMap data, have not been exhaustively cataloged in terms of [which] precise copy number variations are in that data,” he said.
“To say, ‘We find regions that are published’ — anybody can do that with even the worst algorithms, so the real challenge in comparing algorithms is having a gold standard,” he said.
“I cannot say that QuantiSNP is better or worse than other algorithms,” he added. “We would love to see some real benchmarks of the algorithms.”
However, like BioDiscovery, he noted that the company’s software is likely to offer a number of advantages over QuantiSNP because it was designed to handle preprocessing for many thousands of samples, and supports both the Illumina and Affy platforms.
He stressed the company’s focus on accounting for batch effects, noting that if researchers “have not corrected for batches, it is a garbage-in, garbage-out scenario.”
Lambert said that the market for copy number-analysis software is likely to grow because “many academic institutions” doing whole-genome studies have only just “started considering” that the same genotyping data can be reprocessed to look at copy number variation.
“You get two experiments for the price of one,” he said. “Everyone who has done a whole-genome study is a candidate for a whole-genome copy number variation study.”
For scientists considering various tools and vendors, he said an important argument is whether the provider has “done a lot of whole genome analysis-studies themselves.”
Lambert explained that Golden Helix has more than a dozen collaborators, and has analyzed data from 20 whole-genome copy number-variation studies. “What we are seeing is giving us much insight into how to analyze data from the get-go,” he said. “We have had to eat our own dog food.”
Lambert is also co-chair of the Copy Number Variation Data Analysis Team within the Genome-Wide Association Working Group part of the Microarray Quality Control Consortium. The working group has a number of goals, including identifying the sources of variation in results, finding associations between copy number variations and phenotype, and building and validating predictive models for those associations, he explained.
“Finding very large regions of variation — that’s not too hard to do,” he said, noting that challenges remain in finding the relationship between copy number variation and disease. “There are very few published results to date and a lot of them are qualitative in nature versus [having involved] rigorous statistics,” he said.
Final results from any copy number analysis are heavily influenced by the analysis software, but also by issues around experimental design, randomization on the plates, and sample preparation, Lambert said.
So far, he said, the MAQC team is finding that researchers might have to “go back to some of the fundamentals” around data generation as well as data analysis.