To hear Charles Lee, a Harvard Medical School cytogeneticist, tell it, he and his colleagues came upon the notion of copy-number variants more or less by accident.
Not that researchers hadn't previously noticed these genetic elements — roughly defined as DNA segments occurring in different numbers across different genomes. In the early 1990s, for instance, Baylor College of Medicine professor James Lupski explored the role of CNVs in Charcot-Marie-Tooth neuropathy, identifying duplications linked to the most common form of the disorder.
Such early CNV work, though, generally presumed that such variants were rare, deleterious events, usually linked to disease. This understanding, Lee says, held until 2004, when a pair of papers — one by Lee and his collaborators in Nature Genetics and another in Science by Cold Spring Harbor Laboratory researcher Michael Wigler and colleagues — appeared a week apart, both demonstrating that CNVs were widespread throughout the human genome and likely a significant source of natural genetic variation.
"Prior to 2004, [copy number] gains and losses were thought to be very rare and associated with highly penetrant genomic disorders," Lee says. "It wasn't until 2004 [that] our group and the group of Mike Wigler published papers back to back showing that there are a lot of gains and losses in the human genome in healthy individuals."
Not, however, that this was what they'd set out to do. Rather, Lee recalls, he and his co-authors were simply doing control experiments comparing the genomes of healthy individuals to one another using array-comparative genomic hybridization.
"Everything we had been taught up to that point about healthy individuals was that they would not exhibit any gains and losses when you compared their genomes to one another, because these [gains and losses] are associated with disease states," he says.
Surprisingly, though, the researchers found widespread gains and losses across the samples they were testing.
Similar results had been previously reported, but dismissed as artifacts of the arrays being used. And so, Lee says, he and his colleagues set out to determine the cause of those artifacts.
"Was it specific sequences that were being spotted on the arrays that were causing it? Were there motifs in certain sequences that were causing it? The first thing we did when we saw these gains and losses was try to validate them to understand more about the underlying sequences," he says. "And that's when it turned out that we had quite a bit of evidence that in fact these were not artifacts, but true gains and losses in each person's genome."
Today, CNVs are considered one of the major forms of genetic variation, implicated in a variety of diseases, including autism, Crohn's disease, and schizophrenia, and viewed as a potential key mechanism of adaptive evolution.
Stephen Scherer, a researcher at Toronto's Hospital for Sick Children and co-author with Lee on the 2004 Nature Genetics article, notes that a 2010 Genome Biology paper published by his group found that roughly 10 times more nucleotides are affected by CNVs than SNPs.
In the study, Scherer's graduate student Andy Pang examined "Craig Venter's genome, which has been sequenced and run on every different microarray known," Scherer says. "[Pang] comes up with the most complete annotation of copy number structural variation in Venter's genome, and it turns out that there's roughly 40 megabases of copy-number variation compared to 3.2 million nucleotides [of] SNP variation."
"Really it was just an issue of having technologies to find this copy-number variation," he adds.
Arrays' staying power
Improvements in microarray resolution have led to the increase in CNV identifications, Scherer says. He notes that, while next-generation sequencing has taken off in many areas of genomic research, CNV work is still dominated by arrays.
"This is a really interesting issue," Scherer says, "because next-generation sequencing has the promise to detect these things in totality because you're getting whole genome sequence coverage. But the reality is that it is very, very hard to annotate within those whole genome sequences where the CNVs are because of the nature of the short reads that you get with next-generation sequencing."
"So in our experience — and this is really the experience of the literature, too — you can find small indels up to about 50 nucleotides quite readily using next-generation sequencing," he adds. "But above and beyond that, it's very difficult. So there needs to be a lot more development of new tools and algorithms to mine [CNVs]."
Primarily, this is an informatics problem, Scherer says. "The data is there [in the next-generation sequences], but it's making sense of it" that is the challenge, he adds.
Harvard's Lee also notes that "a lot of these gains and losses lie in complex portions of the genome, and, up to this point, a lot of the next-generation sequencing data is short read — 100, 120 base pair reads which then need to be put together. And that's not easy to do in complex regions of the genome."
Scherer offers an example: "Say you have a 400 or 500 nucleotide deletion," he says. "It's hard [using next-generation sequencing] to string those together to get statistical probability to prove that, in fact, what you have is a deletion versus just sequence coverage issues."
"It's just very early days [for next-generation sequencing-based methods]," he notes. "We went through the same thing in the microarray days, and I think it's going to resolve. It's just that we're not there yet."
Matthew Hahn, associate professor of biology and informatics at Indiana University, says that he sees next-generation sequencing gaining traction in the CNV field, but likewise observes that read lengths and the accompanying informatics challenges remain issues, particularly in human -samples.
"In humans, the biggest problem is actually the quality of the genome assembly," he says. "You can only really tell if there's a copy-number variant if you have a good reference assembly, and it turns out that some of the places in the genome that are the least well assembled happen to be the most copy number variable."
He adds that one solution to this problem might be using sequencers capable of producing longer reads, such as those made by Pacific Biosciences. Such technology "certainly holds the promise of doing things much better in things like humans," he says.
In fact, Scherer adds, some groups are using PacBio's instruments for this purpose, but, he notes, "the problem there is that they are getting longer reads, but the [platform's] accuracy is not high."
These longer reads are, however, "sufficient to use as a kind of scaffolding to tie on some shorter Illumina or Life Tech sequence reads," Scherer says. "So there have been some papers published where groups are using these longer PacBio reads" in conjunction with shorter reads.
Still, he says, arrays remain the primary tool for CNV studies, with Illumina and Affymetrix the main players in the research space and the bulk of current diagnostic work being done via comparative genomic hybridization using Agilent-based platforms.
Aside from arrays and sequencing-based approaches, Lee says he also finds certain optical mapping approaches under development by firms including OpGen and BioNano Genomics and by University of Wisconsin-Madison genetics professor David Schwartz to be interesting.
"I guess because I'm a cytogeneticist and very visual, I've been keen on any sort of optical-based technology that could complement arrays and next-generation sequencing for structural variants," he says.
Optical mapping technologies, Lee notes, essentially roll the DNA out onto a microscope slide "and then hybridize to it multiple different colored probes so you can visualize deletions and gains and duplications and how they are organized in the genome."
In addition to helping detect CNVs, such techniques could also aid scientists in localizing them and better understanding their structure in the local environment, he says. "For example, you could have an individual who has two [extra] copies of [a given] gene, but by [optical-based techniques] you may actually find out that there are two copies on one chromosome and zero copies on the other, and that could be functionally significant," Lee says.
A drawback of such methods, however, is their low throughput. "We can use [them] to interrogate part of a genome, but if you want to look at thousands of CNVs in a single individual, that takes a lot of effort," he adds.
In the clinic
CNVs have emerged as important clinical features, Scherer says, noting that, at the Hospital for Sick Children, clinicians run 5,000 to 6,000 clinical microarrays per year examining children with intellectual disabilities and congenital problems.
The practice has become standard across North America, he adds. "If there is a new diagnosis or a diagnosis where doctors don't really know what it is, they will run clinical microarrays as a first line of testing," he says.
Lee suggests that more than being a black-or-white identifier of a given disease, CNVs could prove clinically valuable as indicators of increased susceptibility to certain disorders.
"For example," he says, "we now know that what we call copy-number variant load is associated with increased susceptibility to [disorders including] autism, schizophrenia, attention deficit/hyperactive disorder. ... I think this is a very exciting area where we are rapidly identifying associations of specific CNVs or ... an increase in CNVs in individuals with those diseases."
CNVs might also help account for the heritability of certain complex diseases, Indiana's Hahn says, calling this, in his opinion, one of the most interesting implications of research into these variants.
"People have suggested that CNVs might be a cause of the missing heritability [problem] — the problem that we don't seem to be able to map variants responsible for some of these very complex diseases," he says. "And so the reason would be maybe that these mutations [responsible for these diseases] can occur over and over again, the same CNV can occur multiple times."
CNVs also have potentially significant implications for understanding evolutionary processes, Lee notes, citing a recent paper that he and his fellow Harvard researchers Rebecca Iskow and Omer Gokcumen published in Trends in Genetics in which they suggest that, although it is still early days in terms of understanding the role of the variants in evolution, "the scientific community should anticipate the increasingly accurate discovery and analysis of CNVs, which, in turn, will highlight new regions of the human genome affecting adaptation."
Additionally, in the paper, the authors listed more than 100 human CNVs that are potentially under positive selection. "Definitely there are examples of CNVs that appear to be under positive selection," Lee says. With more likely to come.