By Monica Heger
Using sequencing to call structural variants has proven to be more complicated than calling SNPs, but two separate studies published this week in Nature Genetics indicate that sequencing technologies can help provide new insight into human copy number variation.
In one study, led by the Wellcome Trust Sanger Institute, researchers used sequencing to identify two separate mutational trends in over 300 CNVs. In the other, researchers from the Genomic Medicine Institute in Seoul, Korea, used a mix of sequencing and microarrays to identify CNVs that appear to be specific to Asian populations.
Many researchers believe that a better understanding of CNVs will have important implications for health and personalized medicine, but so far, "we really have a limited knowledge of the forces driving mutation and evolution in the genome," said Jonathan Sebat, chief of the Beyster Center for Molecular Genomics of Neuropsychiatric Diseases at the University of California, San Diego, who was not involved with either study. "So, cataloging the mutations, and understanding the footprints [they leave], will give you clues."
In the study by the Sanger Institute, the researchers sequenced the breakpoints of 324 CNVs from three HapMap samples. In an approach similar to exon capture, they used a DNA capture array from NimbleGen to capture genomic regions known to harbor the breakpoints and the Roche 454 GS FLX with Titanium reagents for sequencing. They generated average read lengths of around 300 bases and 290,808 nonredundant mappable reads.
The approach had its limits, however. While the researchers targeted just over 1,000 different regions with potential CNV breakpoints, they were able to successfully capture and sequence only 324 of those breakpoints. Improvements to the mapping approach or the probes used could improve that number, the authors wrote. Also, they captured mainly deletions, despite the fact that they predicted around 20 percent of the rearrangements to be duplications.
"A modified strategy for capturing duplications — by targeting additional sequence reads within the breakpoints and using de novo assembly of all targeted reads — seems particularly appropriate, considering the enrichment of repetitive contexts at duplication breakpoints," they wrote. They also noted that their method did not capture more complex rearrangements, or breakpoints embedded in repeats larger than 300 base pairs.
Charles Lee, associate professor of pathology at Harvard Medical School and Brigham and Women's Hospital and an author on both papers, agreed that the method could be improved but said that it should prove useful until the costs of whole-genome sequencing drop further.
"I think there is still a lot of improvement that can be made in this technique of pulling down sequences and enriching them for sequencing," Lee said. "Sometimes it's nice to be able to target [a small fraction] of the genome and sequence it at a very high depth to make sure you're not missing anything."
For the 300 CNVs whose breakpoints the researchers were able to sequence, they observed two major mutational trends. Seventy percent of the deletion breakpoints had between one and 30 base pairs of microhomology, while 33 percent of the deletion breakpoints contained up to 367 base pairs of inserted sequence. There was also little overlap between the two trends: only 10 percent of the breakpoints contained both microhomology and inserted sequence.
"It's interesting that there was not a lot of overlap between the two flavors of mutation," said UCSD's Sebat. "It tells you that there are two different mutational mechanisms at work."
Sebat added that understanding these mechanisms will be important for understanding how and why structural mutations occur, which will eventually have implications for disease.
For instance, diseases like autism and schizophrenia are often characterized by de novo mutations, he said. "There are interesting factors that may influence the frequency of these types of rearrangements, and understanding the mutational mechanism will help us understand those influences."
Lee said that there are three different major types of mutational mechanisms: based on homologous recombination, non-homologous recombination, and errors in DNA replication. The microhomology that was observed is indicative of nonhomologous rearrangements, Lee said. "These trends are nice to see; they help us to understand and classify these different CNVs better," he added.
Population Specific CNVs
In a second study, researchers combined array-CGH data with whole-genome sequencing data to generate a catalog of CNVs common in the Asian population. An array-CGH analysis was done on 30 individuals of Korean, Chinese, and Japanese origin. Whole-genome sequencing was also done on a randomly chosen European HapMap individual as well as an Asian individual, and data from a previously sequenced Asian genome was also used. The combination of whole-genome sequence data and array-CGH data allowed the researchers to determine absolute copy number values.
"We realized that array-CGH was a very cost-effective way to scan the genome for CNVs, but there is also a limitation — it provides relative copy number information, because it is with respect to a reference genome, and a lot of times, the reference genome is a European male," said Harvard's Lee. "So, we need to convert this relative CNV information into absolute CNV information" by using information from both the reference genome and an additional Asian genome.
For the two individuals that they sequenced for this study, the researchers used a paired-end sequencing strategy on Illumina's Genome Analyzer with read lengths of 36, 76, and 101 base pairs, and they sequenced the Asian genome to 32-fold coverage and the European HapMap genome to 28.3 fold.
By combining array-CGH data with sequencing information, the researchers were able to reduce an initial 250,000 putative CNV segments in the 30 Asian individuals to around 21,000. Further analysis revealed that 6,000 of those were actually CNVs in the reference individual, but the studied individuals were diploid in that segment.
The researchers found that "the average lengths of CNV segments with copy number losses and gains were 11.8 kilobases and 30.3 kilobases, respectively. In genic regions, copy number gains were more frequent than copy number losses." They also found an average of 670 CNVs per individual.
To find Asian-specific CNVs, the researchers first grouped the CNVs with greater than 50 percent overlap between segments, and found 5,177 CNVs with a median size of 2,667 base pairs. They then compared those CNVs to 4,978 CNVs identified by a recent study of European and African individuals, and found 3,547 putative Asian-specific CNVs.
"I was a little surprised that we found upwards of 3,000 CNVs that seemed to be Asian-specific," said Lee. "It reinforces the fact that we need to explore these types of variants in more populations."
Lee added that characterizing CNVs in different populations will have important implications for disease research and personalized medicine. For example, the team found Asian-specific CNVs in genes that have been reported to be involved in type-2 diabetes, myocardial infarction, and cancer.
"It provides investigators potential candidate regions that they may be able to target in disease cohorts. If gene A is CNV in the Asian populations and we have an Asian genome-wide association study with a disease, they can target that gene," he said.
Lee added that while the two studies were different — one gave a detailed look at specific known CNVs, while the other attempted to create a catalog of CNVs — they both demonstrated the importance of rearrangements for understanding mutations and disease.
"SNPs aren't the only genetic variants out there," he said.