Research initiatives such as the 1,000 Genomes Project have started to use new high-throughput sequencing technologies to map structural variations in individual human genomes and to detect SNPs and small indels.
But some scientists caution that the new sequencing platforms may miss a significant fraction of structural variants, and that older-generation technologies such as fosmid-based sequencing and microarrays will still be needed, at least for the time being.
“We have got all those new genomes being generated by the 1,000 Genomes Project, and you are going to get all this new variation [information] coming from that, but the question is, ‘What are you missing, and what are you actually capturing?’” said Evan Eichler, an associate professor in the department of genome sciences at the University of Washington in Seattle.
Three weeks ago, he and his colleagues published a study in Nature in which they analyzed intermediate-sized insertions, deletions, and inversions in eight HapMap individuals using a fosmid-based Sanger sequencing approach.
That study, representing the first phase of the Human Genome Structural Variation Project (see In Sequence 5/15/2007), found almost 1,700 structural variants that were larger than about 5 kilobases: 747 deletions, 724 insertions, and 224 inversions. Among them were approximately 500 new insertion sequences that were not present in the human reference genome, many of which differ in copy number between individuals. Although these represent less than 0.1 percent of the euchromatin, they may be important for disease associations, according to Eichler.
Less than 50 percent of the 1,700 variants overlapped with the 1,300 structural variants identified by a team led by Mike Snyder at Yale University last year, Eichler noted, even though the two studies had one sample in common. The Yale study used paired-end 454 sequencing to discover variants larger than 3 kilobases in two samples, one of them a HapMap sample also studied by Eichler’s group (see In Sequence 10/2/2007).
Because the Yale researchers used smaller inserts, they captured smaller structural variations that Eichler’s team missed. But they were unable, he said, to get a lot of the larger variants that his team picked up for two reasons: their sequence coverage was lower, and the short 100-base paired-end 454 reads gave them “less power to map in repetitive or complex regions of the genome,” according to Eichler. Although his team is still confirming this, he said that the 454 sequencing approach is “very opportunistic; you get what you get.”
As a consequence, “we think a lot of this complex variation that we found will not be identified by just sequencing genomes with next-gen [sequencing] technology,” he said.
Snyder agreed that his team’s 454 approach makes it more difficult to sort out complex regions of the genome, but Eichler’s clone-based method is “much more expensive” and labor-intensive, he told In Sequence by e-mail last week.
Sequencing technologies weren’t the only tools that failed to detect all of the structural variants that clone-based sequencing picked up. Half of them, Eichler pointed out, were not captured by “the biggest and the best” high-density SNP arrays from Affymetrix and Illumina, either, a potentially important finding for disease-association studies.
Direct paired-end sequencing is easier, cheaper, and higher resolution than clone-base approaches, according to Snyder. “Cloning will go the way of dinosaurs.”
“[If] even with all those SNPs you cannot capture 50 percent of this variation, [this] means that we need to have new technologies, new designs. We need to have more sequence information to be sure of what’s there and what’s not there,” Eichler said. ”You cannot find association with disease if you don’t have your markers.”
Conversely, Eichler’s fosmid-based sequencing approach also missed variations. Besides lacking those variants that are smaller than about 5 kilobases, the method under-represents areas with perfectly identical duplicated sequences “because we would not be able to place our end sequences to a best location,” Eichler said. Array-CGH could “in principle” find those, provided the copy number is not too high.
For now, he said, researchers will need to employ a hybrid approach to capture all sequence variation, including second-generation sequencing, microarrays, and clone-based paired-end sequencing.
Others share Eichler’s view that next-generation sequencing alone is not sufficient. “I am a strong proponent that for at least the next few years, to really get an accurate and usable template sequence, you will need to couple [next-generation sequencing] sequence with some high-resolution array work to guide assembly and make sense of many regions of the genome,” said Steve Scherer, a senior scientist in the department of genetics and genomic biology at the Hospital for Sick Children and a professor of molecular and medical genetics at the University of Toronto.
“We read a lot of the hype coming from the NGS companies and do indeed share their enthusiasm for the impact their new technologies will have, but it does need to be tempered with reality,” he said.
As of today, he cautioned, there is no data to support the claim that complete genomes can be generated by the current new sequencing technologies. “I think this will come in the future, but there is a way to go.”
Some are more optimistic. According to Snyder, “there is no question that our [paired-end 454 sequencing] approach — and related direct paired-end sequencing approaches — will be the one that prevails,” he said. “It is easier, cheaper, and higher resolution [than fosmid-based sequencing.] It can be easily adapted to get paired ends for fragments of different sizes, which will help sort out [structural variations] of all sizes.
“Cloning will go the way of dinosaurs,” Snyder added. “Coupling paired-end sequencing with regular reads will allow de novo assembly of entire genomes and should get most information from a genome sequence.”
Others believe that there is still a place for clone-based Sanger sequencing. For example, researchers at the J. Craig Venter Institute have used three different sequence-based approaches to study structural variation in the Human Genome Project’s reference genome and in Venter’s HuRef genome — mapping paired-end reads from fosmid clones, similar to Eichler; paired Sanger reads; and paired 25-base reads from Applied Biosystems’ SOLiD platform.
“I think it’s safe to say that each of the approaches is necessary to obtain the most comprehensive collection of structural variants,” Ewen Kirkness, a researcher in the genomic medicine team at JCVI, told In Sequence by e-mail.
At the Biology of Genomes meeting in Cold Spring Harbor this month, he presented data from a project designed to evaluate how well short-read technologies can detect variation between human genome sequences. In that project, the researchers sequenced three different libraries from the HuRef genome with insert sizes ranging from 0.8 to 7 kilobases using ABI’s SOLiD technology and mapped the 25-base paired-end reads to the NCBI Build 36 of the human reference genome.
“While we remain in a phase of discovering novel structural variation among human genomes, I think the ‘old-gen’ technologies still have an important role to play,” Kirkness said. “When we have a better catalog of the range of variations that exist among human genomes, and the ability to generate, and map, short reads in an unbiased fashion, these new [sequencing] technologies should permit a cost-effective means of identifying most of an individual’s structural variants. But, I don’t think we are quite there yet.”
In the meantime, Eichler’s Human Genome Structural Variation project is going to sequence another dozen HapMap samples using the fosmid-based approach. He is also collaborating with researchers at ABI who have been sequencing a Yoruban HapMap sample with paired-end SOLiD technology (see In Sequence 2/26/2008).
The 20 HapMap samples sequenced by the HGSV project are among the samples that the 1,000 Genomes Project, in which Eichler participates as a member of the analysis group, is sequencing with next-gen technologies.
A comparison of results will allow the scientists to determine “how much extra bang for the buck we are getting” by fosmid sequencing, Eichler said. “If [the 1,000 Genomes Project] can solve 90 percent of the structural variation, and they don’t need the fosmid project, then maybe it’s fine to do it that way.”