The diverse world of structural genomic variation research — which includes investigations into copy number variation and mapping myriad inserted, deleted, inverted, and translocated genes — is undoubtedly providing investigators with an exciting and promising source of data on human diversity and disease susceptibility. But if a Nature paper published by the 1,000 Genomes Project's Structural Variant group in February is any indication, eureka moments in this field may be a bit further off than researchers originally hoped. The report — which represents the culmination of roughly two years' work involving more than 50 investigators from across the world — describes the group's construction of a CNV map based on whole-genome sequencing data from 185 human genomes. It encompasses roughly 22,000 deletions and 6,000 insertions and tandem duplications. Using a genotyping approach that examined several partial- and whole-gene deletions, the researchers reported a depletion of gene disruptions among high-frequency deletions as well as differences in the size spectra of structural variants.
While the team produced a robust resource for future sequencing-based association studies, Charles Lee, the group's co-chair and director of Harvard Medical School's Molecular Genetic Research Unit, says the take-home message is that considerable barriers must still be overcome before the field can move forward. "We found that we needed new algorithms to identify structural variants and we ended up creating 19 different computer programs. No one program was sufficient — we had to combine multiple programs to maximize the amount of structural variation we are picking up," Lee says. "But at the end of the day ... even at high coverage, we are picking up probably about 82 percent of known deletions, about 15 to 18 percent of known duplications, and essentially no inversions or translocations that we can verify at this stage — so we have a long way to go. If that's where we're at with over 50 investigators, 19 algorithms, and two years of work, we have a long ways to go."
Stephen Chanock, chief of translational genomics at the National Cancer Institute, says that while the generation of resources like the 1,000 Genomes Project are important to better explore genomic structural variation, the need for analytical accuracy will be the pinch that wakes the dreamers up to face reality. "The excitement of having more and more tools always bring us back to the very important question of having to validate or replicate, and I worry that that's getting lost as everyone gets so excited about the next really cool tool. Those are all in silico observations; you still have to go back and make sure that variant is stable and matches what you think you've seen when you actually sequence a genotype," Chanock says. "CNVs, I think, are very interesting for rare or less-common diseases, although the common disease, common variant hypothesis for CNVs has been not quite as exciting as everyone had hoped. It didn't have the drama that everyone thought was there, unlike [in] the common SNP world. ... Ultimately, the technologies are making it easier and we may be going after uncommon and rare variants if whole-genome sequencing kicks in, or at least much denser chips become available."
While there are many tools available to identify structural variants, the question of determining which reported variants are actually valid remains a large challenge that bioinformatics tools alone cannot deal with. "I'm not saying any one study is bad, but there is an under-appreciation for the amount of false-positives in the structural variation data that we're generating as a scientific community from next-generation sequencing data," Harvard's Lee says. "My advice to people who are analyzing next-generation sequence data in structural variants — especially for whole genome analyses — is to use as many technologies to complement their analysis as possible. For example, if you're whole-genome sequencing a given individual, maybe use different insert-sized libraries complemented with arrayCGH data. And, by all means, perform a significant amount of validation so you can minimize the amount of false-positive data."
The limitations to productivity Lee and his colleagues face when using multi-color probes to look at the structure of repeated genes using the fiber FISH technique is just one area in need of improvement. "It's just not high-throughput enough, so if someone could come up with a high-throughput method, that would be an excellent way to genotype some of the more copy number variable regions," he says. "I think also the arrays themselves are continually being improved in terms of what probes are being placed on there to genotype specific CNVs, but there needs to be more effort put into the technology for accurate genotyping of CNVs." For now, Lee says, the only work-around is putting in hours of labor to get the job done.
The most significant development with genotyping and CNVs over the last few years is the development of high-resolution array comparative genomic hybridization. This technique enabled the very first studies that mapped structural variation genome-wide in 2003 and 2004. Since then, advancements in high-throughput paired-end mapping, read depth of coverage analysis, split read analysis, and assembly have all seriously ramped up research efforts. "We consider massive paired-end mapping a key technique to identify structural variation and genomic rearrangements," says Jan Korbel, group leader at the European Molecular Biology Laboratory. Korbel and his colleagues at Yale University and 454 Life Sciences developed an approach for massively parallel paired-end sequencing that is helping the team to identify germ-line structural rearrangements in connection with the 1,000 Genomes Project and the International Cancer Genome Consortium. "The key advantage of paired-end mapping [is that] it allows a fairly deep and quick and cheap sequencing structural aberrations in the genome by recognizing ends of long fragments and mapping them," Korbel says.
Some newer genotyping tools show particular promise, he adds. These include the SUN genotyping method, developed by Evan Eichler's group at the University of Washington, which identifies "singly unique nucleotide" positions to genotype the copy and content of specific paralogs within gene families that are highly duplicated, and the analytical software framework Genome-STRiP, developed by Harvard University's Steve McCarroll for characterizing genome structural polymorphisms using multiple types of next-generation sequencing data including read depth, read pairs, and split reads.
Korbel's own group has designed a novel computational method to analyze the depth of coverage of high-throughput DNA sequencing reads, called CopySeq. This tool can infer locus copy number genotypes by integrating paired-end and break point junction analyses based on CNV-analysis approaches such as arrayCGH and FISH. In November, Korbel demonstrated CopySeq in a PLoS Computational Biology paper in which the team used it to genotype 500 chromosome 1 CNV regions in 150 genomes sequenced at low-coverage and to analyze gene regions enriched for segmental duplications by comprehensively inferring copy number genotypes in the CNV-enriched olfactory receptor human gene and pseudogene -loci. Using CopySeq, they found that for several olfactory receptor loci, the reference genome appears to represent a minor-frequency variant — a finding that could inform future functional studies.
As far as discovery methods are concerned, Korbel says he is waiting for a technique that can identify unique CNVs, irrespective of their sizes, as well as those in segmental duplications. "There are still regions in the genome that are very poorly understood and are hard to compare between individuals and with current technologies. We are unable to correctly resolve for these regions. ... Some of them are relevant for medicine, so that's a huge challenge," he says. "The data is good and so much is being generating by newer techniques, but we're still not fully exploring all the benefits of this data yet because we're still developing suitable methodologies that combine all types of signature signals in the data. We're obviously trying to improve this, but there's still a challenge there."
Recently, a team of researchers from Yale and Stanford University developed a method for genotyping and CNV discovery from read-depth analysis of personal genome sequencing. In February, they published a paper in Genome Research describing a method called CNVnator, which is based on a combination of the established mean-shifting approach with multiple-bandwidth partitioning and GC correction. The team used 1,000 Genomes Project validation data sets to calibrate CNVnator so it could be applied to CNV discovery, population-based genotyping, and the characterization of de novo and multi-allelic events. The team also reported its identification of six de novo CNVs in two family trios.
"The technology has sort of changed incrementally over the last decade, but the large data sets that we accumulated really made all the difference and allowed groups to start definitively identifying genetic factors that contribute to autism and schizophrenia," says Jonathan Sebat, an assistant professor at the University of California, San Diego.
"In 2011, the biggest game-changer is the short read sequence data, and shortly on its heels, the long-read, third-generation sequence data. The methods for detection of variants and the spectrum of potential disease alleles that you can find now is enormous, so that's a complete game-changer there."
In February, Sebat published a paper in Nature describing a large, two-stage genome-wide scan of rare CNVs that associated copy number gains at chromosome 7q36.3 with schizophrenia. Their findings implicate altered vasoactive intestinal peptide signaling receptor gene VIPR2 in the pathogenesis of schizophrenia and indicate the VPAC2 receptor as a potential target for future antipsychotic drug development. "What's new and interesting about that is that the structural variants that we're finding contrast [with] the large microdeletion syndromes that we knew about from the early CNV studies. We're now honing in on the smaller CNVs, not the big, non-allelic homologous recombination-mediated deletions that we used to see," Sebat says. "We're now seeing structural variants that are mediated by other types of mutational mechanisms. The break points are not the same in different patients — they're overlapping, but very different risk alleles. When we get our disease association, we end up finding many different rare mutations in the same gene, often with the same functional impact." He adds that down the road, new CNV findings will not only be used to pinpoint specific genes but identify neurobiological processes in diseases as well.
The University of Washington's Joshua Akey and his colleagues are refining approaches to explore patterns of genomic variation using exome sequencing, as it allows them to use data from thousands of individuals rather than from the mere handful they'd afford using whole-genome sequencing. "It's really striking to be able to look at a data set of 2,000 individuals because you have such deep insight into patterns of variation and you get a real appreciation for the structure of rare variation that you can't get when you only have 20 or 40 individuals," Akey says. "One of the most interesting things that we'll be able to do with thousands of individuals is make very detailed inferences into recent human history. You can't do that unless you have thousands of individuals. For the first time, we can see these dramatic expansions in human population sizes that have occurred in the [past] couple thousand years."
Akey is involved in several structural variation research projects, including one study that looks at the genetic basis of adverse drug responses across dog breeds. He is working, in collaboration with his colleague Evan Eichler and Washington State University's Katrina Mealy, to characterize the distribution of segmental duplications and CNVs across 20 dog breeds with arrayCGH, as it functions at a higher resolution than chromosome-based comparative genomic hybridization.
Later this year, Akey and his colleagues plan to publish what he describes as one of the largest and most comprehensive studies into patterns of human genetic variations using high-quality data from roughly 2,000 exomes. Although the rise of exome sequencing has undoubtedly caused excitement and heightened expectations within the structural variation research community, he cautions that the real insights are only going to come from taking a step back and determining how to interpret and compare those sequences from that many individuals. "There's a critical need for further methodological development to be able to fully extract all of the information in these complex data sets and the challenge is that there are so many challenges," he says. "Let's assume that the genotypes we have are accurate: what do we do with that data in terms of making inferences about human history and about disease susceptibility? What's the best way to test for association between rare variants and disease? What's the best way to look for natural selection? There are challenges from the very beginning of the process to the very end of the process. A lot of theoretical work needs to be developed to fully exploit the information that CNVs have."
While the literature contains a growing number of studies that demonstrate associations of common simple CNVs with specific disease susceptibilities, forming a substantial collection of common CNVs, the issue of resolution still hinders researchers who aim to study rare CNVs. "I think we have a very nice catalog of common copy number variants and we have methodologies to pick up the rare CNVs, although not as high resolution as I'd like to see, but it's the cost effective way of doing it," Harvard's Lee says. "We have 18 to 20 of these very clear associations — these are deletions that increase your susceptibility with more common disease — and I think there are more to come. But the issue we have right now is that we don't have a catalog for the rare variants and the smaller ones. Once we start to develop those catalogs, we can start to improve on our arrays, or whatever method we use to detect CNVs in the disease association studies, to see if any of those rare, smaller CNVs are associated with other diseases."
NCI's Chanock, who is also a physician, cautions that the community should be realistic with regard to the potential for all this structural variation data to facilitate improvements in the clinic. "We've started to make very important steps, and when we look at the age of CNVs and a good part of the sequencing that's going on, the discovery element is spectacular — -almost unprecedented," he says. "The plausibility and the meaning of this discovery is complex: each one of these regions requires its own study and it's still a work in progress to reach the level of confidence and validity that's needed to incorporate that into our clinical workflow. We have to be careful with all the ballyhooing about 'The genomic age is going to turn everything into Star Trek medicine,' because I find this dangerously naïve."