Three papers published in this month’s issue of Nature Genetics discuss the ability of current array platforms to meet the needs of researchers performing genome-wide association studies to identify SNPs and copy number variants of interest.
While all three papers, authored by researchers at the Broad Institute in Cambridge, Mass., and the University of Washington in Seattle, were intended to provide additional background on the high-density research tools sold by Affymetrix and Illumina, they also shed light on needs not met by the most recent generation of arrays. As such, they could serve as guideposts for future array designs.
In the first paper, “Integrated detection and population-genetic analysis of SNPs and copy number variation,” researchers at the Broad discuss the ability of Affymetrix’s SNP 6.0 Array to measure 906,600 SNPs and copy numbers at 1.8 million genomic locations [McCarroll, et al. Nature Genetics. 2008 Oct;40(10):1166-74.]
Using the SNP 6.0, which Affy has been selling as a catalog product since mid-2007, lead author Steven McCarroll and others characterized 270 HapMap samples and developed a map of human copy number variation informed by integer genotypes for 1,320 copy number polymorphisms, or CNPs, that segregate at an allele frequency of greater than 1 percent.
The Broad team also found that around 80 percent of observed copy number differences between pairs of individuals were due to common CNPs with an allele frequency of greater than 5 percent, and that more than 99 percent of differences derived from inheritance rather than new mutations.
Additionally, most common, diallelic CNPs were in strong linkage disequilibrium with SNPs, and most low-frequency CNVs segregated on specific SNP haplotypes, according to the paper.
McCarroll told BioArray News this week that the Broad’s characterization of copy number variation could translate to better research tools.
“You could imagine a situation where CNVs are overwhelmingly rare mutations or you could imagine a situation where most CNVs, like most SNPs, are common in the human population. Our paper says that it is the latter,” he said. “That means you can use a catalog of known CNVs to design assays specific to the overall population going forward, rather than trying to find rare CNVs anew in an individual’s genome.”
Affymetrix has recently discussed its intention to launch a new generation of genotyping products sometime next year based on an internal scan of 1,300 individuals (see BAN 9/30/2008). While McCarroll declined to comment on whether he has had access to new tools from the array maker, he said that it was a “no brainer” that the platforms for investigating SNPs and CNVs would continue to evolve and improve.
“The current arrays are a huge improvement over what was available before and we should take it for granted that things will improve as we get better information to improve array design,” he said.
The second Nature Genetics paper details a new way of interpreting information from both Affymetrix and Illumina’s chips. Entitled “Integrated genotype calling and association analysis of SNPs, common copy number polymorphisms and rare CNVs,” the paper describes Birdsuite, a four-stage software analysis tool developed at the Broad for deriving integrated and mutually consistent copy number and SNP genotypes [Korn J, et al. Nature Genetics. 2008 Oct;40(10):1253-60].
According to the paper, Birdsuite sequentially assigns a copy number across regions of common copy number polymorphisms, calls SNP genotypes, identifies rare CNVs via a hidden Markov model, and generates an integrated sequence and copy number genotype at every locus.
“Birdsuite is also the first algorithm that takes a central idea of SNP analysis — that an empirical catalog of polymorphisms can be used to disentangle the problem of ab initio discovery from that of highly accurate measurement — and applies it to copy number analysis,” the authors state.
“There is no single solution available at the moment.”
To date, CNV calls ’’have been based on the results of genome-wide discovery algorithms, which can lead to false negatives and positives that might be tolerated in the creation of initial CNV catalogs, but that create tremendous problems in association studies that rely on accurate genotyping across large cohorts, the authors argue.
According to the authors, the use of Birdsuite combined with higher-density hybrid arrays and maps of genome variation at lower frequencies and in more diverse samples should enable a “next generation of genome-wide association studies that provide unbiased, phenotype-driven genome screens for a deeper and more detailed examination of the role of DNA variation in human disease.”
Lead author Joshua Korn told BioArray News this week that Birdsuite was developed out of the “need to study SNPs and copy number in the same chip at the same time and resolve some of the questions of the copy number world.
“Our studies of common diseases revolve around common variants,” Korn said. “To just find these variants and have a map of where they are located is important; to find rare ones and know the well known ones so they don’t overwhelm your sample,” he said. “That was one reason for the creation of Birdsuite, this differentiation between common and rare.”
The Broad made Birdsuite available online last month. Though the software was originally designed to work with the SNP 6.0 Array, Korn said he has been working with Illumina to make it useful for researchers using its Human 1M BeadChip.
“The algorithms themselves are general and not specific to any platform,” Korn said. “Illumina is now making a BeadStudio extension, and we are hoping it will be working in a few weeks.”
McCarroll, also an author on the paper describing Birdsuite, said that the way that CNV analysis is framed by the tool, in terms of common CNPs versus rare CNVs, “will define our approach to CNV analysis going forward.”
Room for Improvement
While McCarroll and Korn’s papers provide information and tools on how to best use existing arrays for SNP and CNV analysis, the third Nature Genetics paper, authored by researchers in Evan Eichler’s lab at UW, details some of the shortcomings of these platforms to detect rare CNVs.
Specifically, the paper, entitled “Systematic assessment of copy number variant detection via genome-wide SNP genotyping,” found that commonly used SNP platforms have “limited or no probe coverage for a large fraction of CNVs.”
Despite this, the authors inferred 368 CNVs from nine samples using Illumina SNP genotyping data and experimentally validated over two-thirds of these. [Cooper G, et al. Nature Genetics. 2008 Oct;40(10):1199-203.]
The paper also details a method called SNP-Conditional Mixture Modeling, or SCIMM, to genotype deletions using as few as two SNP probes. “SNP arrays can be used to infer the presence of many individually rare CNVs with reasonable specificity given a considerable probe count, and can furthermore be used to robustly genotype common deletions using as few as two probes,” the authors write.
“However, when considering balanced events, novel insertion sequences not represented in the reference assembly, and the bias against segmental duplications in array designs contrasted with the enrichment for CNVs both within and flanking duplicated sequences, we conclude that a large fraction of genomic variation cannot be captured by existing genome-wide SNP platforms,” the authors write.
According to the authors, “significant improvements to array designs, perhaps in the form of a targeted CNV genotyping platform, may ultimately be necessary.” Lead author Greg Cooper told BioArray News last week that there are “a lot of variants that are represented by an insufficient number of probes [on current platforms], which makes it hard to genotype them.” He said that future arrays would benefit from greater probe coverage as well as probe design.
“The trick is to design a lot of probes and do a lot of experiments to get better probe response from both the chemistry and computational level,” he said. He also ruled out the idea that one platform, such as the SNP-CNV chips available from Affymetrix or Illumina or the second-generation sequencers sold by Illumina, Applied Biosystems, and others, would provide the necessary muscle for SNP and CNV analyses.
“To be truly comprehensive, it’s going to require a combination of array-based and sequencing-based platforms,” he said. “There is no single solution available at the moment.”