Using a new paired-end sequencing method with improved read lengths, a team of researchers from Yale University and 454 Life Sciences has mapped structural variations in two humans.
The new method, which 454 Life Sciences’ parent company Roche plans to make commercially available by mid-November, might find its widest use in genome assembly. But the project also highlights the importance of mapping structural variants in upcoming large-scale human genome sequencing projects that may involve a blend of complementary technology platforms.
The researchers, who found approximately 1,300 structural variations in total, published their results last week in Science. For 454, the article was part of a triple crown of sorts: two other articles published in Science last week featured its sequencing technology.
What distinguished this project from other approaches to cataloging human structural variants was a “significantly higher resolution than was previously possible for a genome-wide study,” according to Jan Korbel, the lead author on the article and a postdoc in Mark Gerstein’s group at Yale. “Therefore, we found many more variants than people have previously found who have looked for structural variants,” especially smaller ones, he said.
In a single individual, the researchers found the largest number of structural variants so far. Last November, a team of researchers led by the Wellcome Trust Sanger Institute published a paper in Nature in which they reported a total of almost 1,500 copy number variations in 270 individuals from the HapMap collection, using comparative genomic hybridization arrays.
In their study, the Yale and 454 researchers broke up the DNA into 3-kilobase fragments, circularized the pieces using a linker, cut the circles at random, enriched for DNA bits that contained the linker, and sequenced the DNA on both ends of the linker. Then they mapped the paired ends to the human reference genome, looking for deletions, inversions, or insertions.
They also validated about 500 of the variants they found by at least one method, for example by mapping the actual breakpoints using a combination of PCR and 454 sequencing.
Knowing the breakpoints of some variants, the scientists could relate them to genes more precisely. “We are really confident when we say a structural variant, let’s say, knocks out an exon, or it fuses genes,” Korbel said.
Though the new method covers more structural variants than other approaches, it is not exhaustive, Korbel admitted. Especially regarding large insertions, “it’s fair enough to say that we are not entirely comprehensive,” he said.
The scientists also did not look for small variations below a kilobase, which, in terms of their size, do not fit the currently used “working definition” of a structural variant, he said.
Improvements of their paired-end method would allow the researchers “to capture more, if not all” structural variants, according to Korbel. Among them are a sharper size distribution of the paired-end fragments, different sizes of fragments, and longer read lengths.
Also, variants in highly repetitive regions are “certainly harder to assess,” Korbel said, noting that this is true for all approaches to map structural variants, be it fosmid paired-end sequencing or array-based comparative genomic hybridization.
“They are able to really define the architecture of these structural variants better than we have been able to do in the past, and give us a sense of how much of these are there in a given individual,” Charles Lee, an assistant professor at Harvard Medical School, told In Sequence last week. He was an author on last year’s CNV paper in Nature.
Also, unlike CGH arrays, which only cover copy number variations, this method is able to pick up balanced arrangements, such as inversions. “That’s clearly a huge advantage of the sequencing approach,” Lee said.
But the cost and time of the Yale/454 project probably means that the method will probably not be used routinely. “This is a proof-of-principle study. Given unlimited funds, and unlimited time, this is what we can get,” Lee said. “On a more practical note, to look at more individuals … you may have to still go to other platforms.”
According to a Roche spokesman, the study took between 60 and 70 runs on 454’s Genome Sequencer FLX. Assuming a list price of $8,000 per run, this means the sequencing alone cost around $500,000.
Lee said that an array CGH experiment at 1 kilobase resolution can be conducted for “under $2,000” and yield results “in a few days,” though it does include inversion-type variations.
“It’s a tour de force-type approach using, for the first time, a paired-end approach based on next-generation sequencing to define structural variation,” said Stephen Scherer, a professor of molecular and medical genetics at the University of Toronto and an author on last year’s Nature paper on CNVs. “But clearly, it’s a very expensive approach.”
Interestingly, he said, the researchers found sequence characteristics that hinted at the molecular events behind the formation of particular structural variants.
Scherer and his colleagues are working on algorithms that incorporate structural variation data from different platforms, such as arrays, whole-genome sequencing, and paired-end sequencing.
Their approach includes Illumina’s sequencing technology, which has shorter reads than 454’s technology. “Of course if you get longer reads, and you get better sequence, it’s easier to do it accurately,” Scherer said. “[With] Solexa, you can do it. But it takes a little bit more work because of the issue of the short read length.”
“It’s a tour de force-type approach using, for the first time, a paired-end approach based on next-generation sequencing to define structural variation. But clearly, it’s a very expensive approach.” |
The Yale and 454 researchers worked with a medium read length of 109 bases on either end, “and we see that this is often just barely enough to map out structural variants in repetitive regions,” Korbel said. “It’s clear that read length matters a lot, but it’s a grey zone. Even with short reads, you get a portion of the events.”
Overall, the new method provides “a technological advance allowing for many more individuals to be analyzed systematically without the construction of libraries,” said Evan Eichler, an associate professor in the department of genome sciences at the University of Washington in Seattle. He coordinates an NHGRI-sponsored consortium to map structural variation in several dozen individuals by fosmid-based sequencing (see In Sequence 5/15/2007).
But because the 454 approach, and others based on next-gen sequencing technologies, lack clone libraries, “you can’t go back to the clones and sequence” Eichler said. As a result, “you are going to opportunistically detect those [structural variants] that are easily detectable,” even though complex regions of the genome are known to harbor about 50 percent of all structural variation. “It’s going to be complementary to other technologies and approaches out there,” he said.
Will his consortium change its approach as a result of these findings? “Absolutely not,” Eichler said. “We are determined at least to focus on a subset of the individuals originally put forward and get these worked out at a very high degree.”
In any case, mapping structural variants will likely be an important part of any future large-scale human genome sequencing studies.
Just because it spans more nucleotides than a SNP, “if you find a structural variant, per se it is more likely to have a phenotypic effect than a single SNP has,” Korbel noted.
“I think it’s extremely important,” said Lee. For example, structural differences will help explain disease susceptibility and understand evolutionary differences between humans and closely related species.
But it’s not clear which technology will be used in such large projects. “I think that no one technology is going to give us the most comprehensive map,” Lee said.
Others agree. “If one really does have the objective of identifying as much variation as possible, you probably need to use multiple platform approaches for now, until it becomes really cost-effective to generate relatively accurate and complete genome sequences,” said Scherer.