Next-generation sequencing technologies have reached the market, but how close are they to ushering in the era of personal genomics?
Researchers at 454 Life Sciences and Baylor College of Medicine have now subjected the Genome Sequencer FLX to the ultimate test: re-sequencing an entire human genome.
At the Advances in Genome Biology and Technology in Marco Island, Fla., last month, two scientists, in separate talks, presented initial results from the ongoing study.
Up until recently, no one had re-sequenced a human genome by any method, explained David Wheeler in an interview with In Sequence last week. He is a researcher in Baylor’s Human Genome Sequencing Center and an associate professor in the department of molecular and human genetics at Baylor, and gave one of the talks at the conference.
“The question would be, what new information can we get from a single person?” he said.
As it turns out, quite a lot. For example, the initial findings show that the 454 technology generated sequence data not contained in the public assembly or in Celera’s version of the human genome. Also, it created not only a SNP map but a map of insertions and deletions.
So far, the researchers have sequenced the human genome — which happens to be that of Jim Watson — to 3X coverage, generating approximately 40 million reads, or 10 gigabases worth of sequence data, with an average read length of 250 base pairs.
454, which produced the data, cranked out most of it within a three-week period in January and handed it over to the Baylor researchers for analysis.
The scientists plan to extend the coverage to 6X, or 3X for each haplotype, and present a complete analysis at the Biology of Genomes meeting at Cold Spring Harbor Laboratory in May.
The researchers plan to release the data publicly but will not lay open medically important bits of Watson’s genome, Wheeler said. “It’s going to be confined to general genome characteristics,” such as structural variation, or data that address the completeness of the human genome or population genetics, he said.
Analyzing a Nobel laureate’s genome has been both a blessing and a curse, it seems. “The minute we start talking about an identifiable person, we have a lot of ethical issues to deal with,” Wheeler said.
454 obtained Watson’s sample in 2005, and “the ethical issues have been evolving very rapidly over the last three years, as the implications of medical sequencing and the banking of that data have become more clear,” he explained.
Analyze This: Watson’s Genome
At last month’s meeting, Wheeler and Michael Egholm, 454’s vice president of molecular biology, presented a first analysis of the data the team has gathered to date.
For starters, the GS FLX seems to have tackled all regions of the genome equally well: the reads were evenly distributed throughout the genome, save for some areas with higher- or lower-than-expected coverage. Those probably resulted from mapping errors in repeat sequences, Wheeler said.
Most of the reads had full-length matches in the genome, and two thirds of all reads mapped in unique locations. To increase this number, the researchers are currently re-analyzing those reads that hit multiple locations with more sensitive alignment methods.
“Overall, the coverage is quite uniform. It appears that we have a nicely unbiased dataset,” Wheeler said. In Sanger sequencing, he pointed out, there are always some unclonable regions.
What’s more, it seems like 454’s technology probably delivered parts that are still missing from the finished human genome: About 1.3 million reads did not find a match in the public assembly. However, about two thirds of them matched a database of known human repeat sequences, indicating that they are of human origin, probably from regions of heterochromatin.
When the researchers compared the 1.3 million unmatched reads against Celera’s assembly, about 20 percent of them matched, mostly in telomeric and centromeric regions.
The remaining 80 percent could be very low-quality reads, contamination with DNA from other species, or new sequences not included in either the public or the Celera assemblies, according to Wheeler.
To see if they would match any DNA, Wheeler compared several hundred of these reads against the non-redundant database of GenBank and found they all matched human DNA, “so I am confident in ruling out contamination and poor quality.”
Moreover, some of them hit human fosmids from the genome centers “that are in finished quality but just not yet placed on the human assembly,” Wheeler said. Further analysis showed that these probably belong to telomeres.
Some of the reads, though, “don’t hit anything even in GenBank,” Wheeler said, and could be unsequenced parts of the human genome, which by most estimates is about 97 percent complete.
“We are going to try to assemble those to see if we can put contigs together [and] find possible little islands of unique sequence that probably map into heterochromatin,” Wheeler said.
Eventually, the data will enable the researchers to do a quality control of the published reference genome, Wheeler said. “I am sure we will find events that turn out to be errors in the public genome, so it may become even more accurate.”
The researchers also looked for single base variations by analyzing only reads that had neither full-length matches nor contained insertions or deletions. In those reads, they found 1.9 million single-base variations, of which 1.3 million matched to dbSNP. The remaining 600,000 are likely to be a combination of new polymorphisms that have not yet entered dbSNP and sequencing or read-placement errors, Wheeler said.
Fifty of the known SNPs matched a database listing phenotypes of human polymorphisms, Egholm said in his talk. When informed of these results, Watson quipped, “Oh, only 50 things are wrong with me,” Egholm reported.
Wheeler’s team analyzed reads with insertions or deletions separately. These could either result from sequencing errors or from real structural variations in the genome.
“The question would be, what new information can we get from a single person?” As it turns out, quite a lot.
454’s sequencer is known for making mistakes in homopolymer runs, mostly by under- or overestimating the number of bases in such runs by one or two. “As expected, indels in the 1- to 2-base range account for the vast majority of all the indels we found,” Wheeler said. “Most of those are probably errors due to homopolymer runs.”
But the researchers found that another 68,000 reads had indels larger than 2 bases, which Wheeler believes are “very accurate.” For example, they found a peak of 310-base indels, as they would expect from so-called Alu elements. “We do see that peak. Therefore, I suspect that the indels above 3 bases are probably all real,“ Wheeler said.
That, he said, is an important result, not only because it means the data provide a comprehensive list of insertions and deletions in Watson’s genome, but also because no other method is able to pick up structural variations in this size range, between 10 and 2,000 bases.
“This range is below the threshold of detection by current methods looking for structural variation, which use paired-end fosmid libraries,” he said. “So this is going to give us a view of variation that no other technique can give.”
The reason the researchers did not use paired reads for this analysis is simply that they were not available. “There would be a lot of added value to paired ends,” Wheeler said. However, “at the time, they were not producing long reads in paired ends. This was the most accurate read technology they had.”
454 is currently evaluating a new method to create long paired reads (see In Sequence 03/05/2007).
Overall, Wheeler said he was impressed by the accuracy of the 454 technology, and by the fact that its comparatively long 250 base-pair reads allow researchers to analyze indels without having to generate read pairs.
Illumina is working on its own analysis of an entire human genome, an African man who was part of the HapMap project, on its Genetic Analyzer. At the AGBT meeting last month, David Bentley, chief scientist for Illumina’s sequencing business, said the company has so far generated 10 gigabases, or 3.4X worth of data.
So is the 454 technology ready for prime-time human re-sequencing? “I think it’s still a little too expensive to start replicating this broadly,” Wheeler cautioned.
According to 454, the run cost for 3X coverage was about $800,000, based on approximately 100 runs on the GS FLX and a reagent list price of $8,000 per run.
However, “when you have people that will spend $20 million to take a ride to space, there are probably people who would want to have this done,” he said.
454 might be able to accommodate them. “We would welcome the opportunity to plan such a sequencing project with interested parties,” Mary Schramke, 454’s vice president of marketing, told In Sequence this week.