By Julia Karow
Researchers at Cold Spring Harbor Laboratory have developed a new computational approach that relies on the depth of coverage of short-read sequence data to detect copy number variants, and have used it to analyze CNVs in a chromosome in five individuals.
The scientists, who published their method online in Genome Research last month, found that they were able to capture structural variants that other approaches, such as paired-end mapping and microarrays, missed. They plan to use their method, in combination with others, to analyze CNVs in large-scale human genome studies such as the 1000 Genomes Project and a planned large-scale schizophrenia genome study.
Jonathan Sebat, an associate professor at Cold Spring Harbor Lab, is the senior author of the paper and a member of the data analysis group for the 1000 Genomes Project. "As part of that project, one of the biggest challenges is to develop ways to detect structural variants in the genome sequence data," he told In Sequence last week.
Other groups, he said, including those of Evan Eichler at the University of Washington and Mike Snyder and Mark Gerstein at Yale University, previously developed methods to determine structural variations from paired-end sequence reads that map to unexpected locations in the genome.
Sebat's own group, as well as others, have been developing new methods, based on read-depth measurements, to capture classes of structural variation that might otherwise be missed. "We were looking for an alternative to paired-end read mapping because any individual method is going to be subject to its own limitations," he said.
According to Lisa Brooks, director of the genetic variation program at the National Human Genome Research Institute, which funds the 1000 Genomes Project, several groups participating in the project are analyzing structural variants and comparing methods. “Different methods are best for different sizes or types of structural variants. Paired-end methods are good for large SVs, but miss smaller ones. The method in this paper is good for finding smaller CNVs but not tiny ones,” she told In Sequence via e-mail.
The approach Sebat and his colleagues came up with — called event-wise testing, or EWT — relies on read depth from short-read sequence data. "You essentially make quantitative measurements of DNA copy number at regular intervals across the genome by counting the number of reads — you measure coverage," he said. "Then you compare different genomes to identify the regions where the regional copy number is different in one genome compared to another."
"Coverage at a specific location in the genome is essentially a measure of DNA copy number," he added, as long as the distribution of reads across the genome is relatively unbiased, which he said was the case with the data his group used in the study.
The method accounts for the fact that Illumina's Genome Analyzer — the only data type he and his colleagues have analyzed with EWT so far — has a slight GC bias. "We used a pretty simple correction to adjust the actual read depth based on GC [content]," he said.
In their paper, they applied their method to analyze CNVs in chromosome 1 of five individuals, sequenced previously on the Illumina GA with 30-fold paired-end shotgun reads. The data came from three CEU HapMap samples of European ancestry, sequenced by the 1000 Genomes Project as part of its pilot phase; the Yoruba HapMap NA18507 genome sequenced by Illumina and published last year; and a Chinese genome sequenced by the Beijing Genomics Institute and also published last year (see In Sequence 11/11/2008).
[ pagebreak ]
In total, they found between about 400 and 1,700 CNVs per individual, which they validated by comparing them to a set of common CNV regions greater than 500 base pairs in size that have been provisionally released by the Genome Structural Variation Consortium, which analyzed 40 HapMap samples with a set of 20 NimbleGen CGH arrays with 42 million probes in total.
One observation the researchers made is that the majority of the CNVs they detected were "monomorphic" in the five samples, meaning they differed from the reference genome in the same way. "The reference genome does not necessarily represent what's present in most chromosomes in the population," Sebat explained. "Additional effort has to be applied in the field to map out the structural variant that's actually presented in the majority of chromosomes in the population."
Interestingly, Sebat and his team found that their read-depth-based approached called a largely different set of CNVs than paired-end mapping approaches did on the same data. For example, in the Yoruba genome sequenced by Illumina, "the majority of the variation was unique to each call set," he said. "That taught us that the two methods are looking at very different CNVs."
Compared to microarrays, both sequencing-based approaches are “much more sensitive” for detecting copy number variations smaller than 1,000 base pairs, which make up the majority of CNVs, he said.
In particular, the researchers found that their own method excelled at detecting variants in complex regions of the genome that were rich in segmental duplications. Those regions are tough for paired-end approaches because the paired reads often map to multiple locations, he said. On the other hand, paired-end mapping methods were better at detecting very small deletions.
Read-depth analysis is limited when it comes to balanced rearrangements, highly repetitive sequences, the precise location of insertions, or finding novel insertions, according to the paper.
Sebat's conclusion is that a single approach is not enough to map CNVs comprehensively. "I think it's clear that you have to use a combination of methods," he said. "It's likely that you would want to use a paired-end read mapping algorithm and an algorithm that's based on read depth."
He said a beta version of his CNV-calling algorithm can be used by other researchers now, and he plans to make the final version available before the end of the year.
His group is not the only one to develop a read-depth-based approach to detect CNVs: Researchers at Johns Hopkins University published a study in 2002 in which they used short sequence tags, generated by Sanger technology at the time, to analyze copy number variants in human cancer cells.
Also, researchers at the Broad Institute published an algorithm called SegSeq last year that they used to identify CNVs in tumor DNA from short reads generated by Illumina's Genome Analyzer (see In Sequence 12/9/2008).
Earlier this year, scientists from Singapore published CNV-seq, another method that uses read depth from high-throughput sequencing data to detect copy number variations, and a few days ago, researchers from the University of Washington published a mapping algorithm — called mrFAST — that allows them to estimate absolute copy number differences between genomes (see In Sequence’s sister publication, GenomeWeb Daily News, 8/31/2009)
[ pagebreak ]
Though Sebat's study only used data from Illumina's GA, their method could also be used on data from other sequencing platforms. However, large numbers of short reads are preferable over smaller numbers of longer reads, the researchers found, somewhat to their surprise. "We had previously thought that longer reads are always better, but that's not necessarily true for structural variants," Sebat said. "For structural variants, the paired-end read mapping and the read-depth approaches are more powerful when you have more reads, not longer reads."
That is not true, though, for all kinds of structural variants, such as indels. "Probably for indels, longer reads are better," depending on the indel-calling algorithm used, he said.
Microarrays, he predicted, will eventually be replaced by sequencing methods as they become more reliable. "Genome sequences are a much richer dataset," Sebat said, and combining different analysis methods for structural variants, such as read-depth and paired-end read mapping as well as indel-calling algorithms "will give you much better sensitivity than any microarray."
Nevertheless, "microarrays are still the major workhorse in our large-scale genomic studies because they have reached the point where doing a microarray scan of SNPs or CNVs can be done for a few hundred dollars a genome," he said. "Knowing that large sample sizes are required for finding what you are looking for, you have to use a method that's cheap and reliable enough to be applied to tens of thousands of samples. That's not currently the case for sequencing technology — we hope that it soon will be."
Sebat and his colleagues plan to apply a combination of their own method and others to detect structural variants in large-scale sequencing projects, including the 1000 Genomes Project and a planned effort to sequence complete genomes of schizophrenia patients.
"When we are able to collect complete genome sequences on thousands of schizophrenia patients and thousands of controls, these types of approaches will be an essential part of our pipeline," he said. "Genome sequencing in psychiatric disorders is a major focus here at Cold Spring Harbor, and is something that we hope to do on a large scale."
In general, structural variants "are a major source of genetic variation, and it's clear that they play a role in human disease," he said. "Any tool you can develop that will extract more genomic information will improve your ability to find disease-causing mutations."