New sequencing technologies could lower the cost of a multi-million-dollar project to map structural variations in the human genome, but the new methods have yet to prove that they are up to the task, according to a researcher involved in the initiative.
The aim of the three-year project, which the National Human Genome Research Institute kicked off in early 2006, is to characterize insertions, deletions, inversions and other large-scale variations in 62 human genomes at the sequence level. Cost estimates run between $35 million and $40 million at the moment.
The project calls for a research consortium involving several genome centers to build and end-sequence clone libraries from the DNA samples, which include 48 female and 14 male HapMap individuals, generating about 50 gigabases of sequence data.
After mapping these paired reads from the clone inserts to the human reference genome, the consortium plans to completely sequence clones with structural variations in order to determine the specific changes that have taken place in that part of the genome.
Similar studies have sought to catalog structural variation in the human genome, such as the Copy Number Variation Project, which published its results last year. However, these studies, which used microarrays, “don’t offer the actual differences at the base-pair level,” Evan Eichler, an associate professor in the department of genome sciences at the University of Washington in Seattle, told In Sequence last week. He led the working group that advised NHGRI and coordinates the large-scale structural variation project. The working group published a detailed description of the project in last week’s Nature.
“We need to have the sequence, the precise breakpoints, so we can develop assays specifically for that allele,” Eichler said. “Until we do it right and get the actual structural variation worked out at the base pair level, we are kind of blind — we don’t really know what we are genotyping,” he said.
Eichler said that copy-number chips currently marketed by companies like Illumina and NimbleGen do not yet cover all the variations. “They don’t even know where many of these regions are, and they don’t have algorithms to accurately call them,” he said.
The results of the project would allow researchers to design genotyping assays that could use PCR-sequencing or chip-based techniques to screen thousands of individuals in order to find out how frequently certain variations occur, and whether they are associated with specific diseases.
The first stage of the project — analyzing the genomes of 10 individuals — is nearing completion. Agencourt Bioscience generated fosmid clone libraries with about 10 million clones per individual and end-sequenced them using conventional Sanger sequencing. Genome centers at Washington University and the University of Washington are now sequencing clones that differ from the reference genome.
At the moment, organizers estimate the project will cost between $35 million and $40 million, with the lion’s share coming from generating the clone libraries and end-sequencing the clone inserts, according to Eichler.
“Until we do it right and get the actual structural variation worked out at the base pair level, we are kind of blind — we don’t really know what we are genotyping.”
Paired-end sequencing of the clones with next-generation sequencing technologies could significantly lower this cost, and the consortium has already tested 454’s technology in collaboration with researchers from Lawrence Berkeley National Laboratory.
Using a ditag sequencing strategy, they cut out the inserts from the clone libraries, circularized them, and cut them to create 18 base-pair ditags, which they linked and sequenced using 454’s technology.
“The sad news for us [was that] we can only place about half the end sequences [on the reference genome],” Eichler said. The other half of the reads were too short to place them unambiguously on the genome, making them essentially useless for the project, he said.
Eichler has also been in talks with Illumina to use the company’s new sequencing platform to sequence a control sample where structural variations are already known.
“The problem is, the new technologies generate a lot of sequence, but it’s short and it’s of lower quality [than Sanger sequencing],” he said. “And about half of the structural variation maps to very complex regions of the genome that are highly repetitive and contain a lot of duplicated sequences. If you cannot map your end sequences on the human genome, you cannot determine whether there is a structural difference.”
Paired-end sequencing with 454’s or Illumina’s platforms could still be useful because of their low cost, he said, but “we suspect that we will miss half the structural variation if we use short reads from these technologies,” he said. The researchers hope to finish testing the new technologies this summer, he added.
“What I would like to see happen is actually the opposite of where the stampede is going. I would like to see longer reads of higher quality,” Eichler said.