Almost a year after announcing they had sequenced the genome of Jim Watson using a 454 GS FLX platform, researchers at Roche’s sequencing arm and Baylor College of Medicine last week published an analysis of the genome.
Watson’s DNA is the first published human genome sequenced with a new sequencing technology. 454 generated the data within two months at a cost of approximately $1 million. However, the genome is not an independent assembly but an alignment of the sequence reads to the NCBI human reference genome.
Based on the results of the project, Baylor now plans to sequence additional human genomes with the 454 technology as part of the Cancer Genome Atlas pilot project.
The scientists, which published their analysis in Nature last week, used the GS FLX to sequence Watson’s genome to 7.4-fold redundancy, generating more than 100 million high-quality unpaired reads, or about 25 billion bases, in over 200 runs. The data, which 454 generated in early 2007 in two one-month spurts, was analyzed by the Human Genome Sequencing Center at Baylor College of Medicine.
After mapping the reads to the NCBI human reference genome, the scientists identified 3.3 million potential SNPs, of which about 2.7 million were contained in the dbSNP database. About 10,000 of the SNPs alter the amino acid sequence of proteins. They also found more about 66,000 insertions and 157,000 small deletions, as well as a few copy number variations that cause gains or losses of chromosomal segments.
The researchers also assembled reads that did not map to the reference, generating contigs that may encode approximately 50 new genes that are not in the NCBI reference genome.
Generating the data cost 454 approximately $1 million, which included a number of costs, such as reagents, labor, and instrument depreciation costs, according to Michael Egholm, vice president of research and development at 454. He estimated that the same project would cost the company about $200,000 today, due to updates to the sequencing technology that enable researchers to generate five times as much data per run.
The team did not tally the cost of the analysis, which would likely be quicker and less expensive in future human genome-sequencing projects because the methods are now established, according to Egholm.
According to David Wheeler, director of bioinformatics at the Baylor Human Genome Sequencing Center who headed the data analysis, “the most important result is that 454 sequencing works [and] is on par with Sanger sequencing in terms of discovering variations.” The sequence quality is high enough, he added, to detect variations “that you would want to find in a personal genome” with a coverage comparable to the coverage needed with Sanger data.
Watson’s genome is the second individual human genome sequence following the publication of Craig Venter’s genome in PLoS Biology last summer (see In Sequence, 9/4/2007).
That study identified approximately 3.2 million SNPs in Venter’s genome, about 6,000 of them altering amino acids, in addition to 300,000 small heterozygous indels, 560,000 homozygous indels, 54,000 block substitutions, 90 inversions, and “numerous” segmental duplications and copy number variations. Venter, whose project relied on Sanger sequencing, said at the time that the study cost approximately $70 million.
“What’s truly remarkable is how similar [the two genomes] are,” said Egholm. “That alone, I think, is an incredible validation, that you independently get the same result.”
“In terms of the polymorphisms that we were able to measure, the two genomes are equivalent,” said Wheeler, although they do not have the same number of copy number variants.
He said the studies found approximately the same number of novel SNPs, and the two genomes shared about half the non-synonymous SNPs the studies identified.
“I actually ultimately think that all resequencing will be de novo assembly.”
The analysis of the Watson genome focused on SNPs, he said, because they are easier to analyze and compare. “The indels are a little trickier because there are issues with mapping indels that always create some variation in the precise localization of them,” Wheeler said.
Single-base indels in particular are still difficult to detect, he pointed out, largely because of the homopolymer run problem of the 454 technology, although that problem also exists with Sanger technology on a smaller scale.
Unlike Venter’s genome, Watson’s genome is not a de novo assembly of paired-end reads, but an alignment of unpaired reads to the human reference genome, which has some disadvantages.
“We believe that an independent assembly of [a] genome is important since it is possible to use the placement of long reads, close to 0.8 kilobases [in length], and their mates to build haplotypes and thus diploid sequence,” Sam Levy, a senior scientist at the J. Craig Venter Institute who headed the Venter genome project, told In Sequence in an e-mail message.
“Further, it also provides a good understanding of where novel sequence, not found in the reference genome, is accurately placed. It is difficult to ascertain the extent of completeness of the Watson genome, given the mapping of 454 reads to the NCBI reference,” he added.
Paired-end reads would have helped to find larger indels, as well as inversions and possibly translocations, according to Wheeler. However, he said he does not think the lack of de novo assembly of the unpaired reads “was a great downside.” An assembly of the single reads might have helped with the detection of larger insertions, larger deletions, and inversions, he said.
Levy also pointed out that he and his colleagues were able to construct a diploid sequence for about 80 percent of the Venter genome, whereas “the Watson genome does not provide a diploid structure, probably due to the difficulty in haplotype reconstruction using short reads [of] 250 base pairs, coupled with the absence of paired-end information.”
Egholm said that although his team indeed did not build a diploid sequence, outside researchers “at least have claimed to us that they have been able to draw out entire haplotypes just from following from one read to another in the [genome browser] at the Cold Spring Harbor website.”
He acknowledged that the project did not find large structural variants that would have been expected. For example, last fall, 454 researchers, in collaboration with Yale University scientists, mapped approximately 1,300 large structural variants in two human genomes using paired-end reads from the GS FLX (see In Sequence, 10/2/2007).
Future human genome sequencing studies using the 454 technology should therefore make use of paired-end reads, Egholm said. “I actually ultimately think that all resequencing will be de novo assembly.”
454 has already successfully completed de novo assemblies of the Arabidopsis and Drosophila genomes, which the company is scheduled to present at the Biology of Genomes meeting at Cold Spring Harbor Laboratory next month. “I think that will make it credible to people that we will also be able to get there with humans, but we are not there yet,” Egholm said.
He did not say whether 454 plans to assemble Watson’s genome de novo using paired-end 454 reads, but said that “it’s an obvious thing to do.”
According to Egholm, the Watson genome is the first of many individual genomes to be sequenced with new platforms. “The whole point is, you need to sequence hundreds of thousands of genomes in order to really say something meaningful and begin to link genotypes and phenotypes, which is what we ultimately want to do,” Egholm said. “But before you start sequencing a thousand genomes, you need to sequence a [single] genome … and characterize the biases that you have in your data. And that’s really what this paper is about, trying to set some standard for how to assess a new sequencing technology.”
According to Levy, JCVI is “currently evaluating different sequencing technologies, including 454, to determine their capability of reproducibly providing us with the kind of sequence data that will enable us to produce independent assemblies of diploid genomes.”
Wheeler said that Baylor plans to use the 454 technology to sequence a number of cancer genomes as part of the Cancer Genome Atlas pilot project. Details, such as the coverage and ratio of paired to unpaired reads, are still under discussion, he added. That study will probably use the so-called XLR extra-long reads that 454 has been developing.
He does not believe that short-read technologies, such as Illumina’s Genome Analyzer, should be used in that project because they can only be mapped accurately to a reduced fraction of the genome. “My own personal bias is that for a human genome, we would not use that [technology],” he said. However, short-read technologies might be useful for sequencing the transcriptome of the cancer samples, he added.
Illumina and Applied Biosystems have both been testing how well their short-read platforms can sequence a human genome using a HapMap sample, an anonymous African Yoruban man from Ibadan, Nigeria. Neither firm has published its data yet, however (See In Sequence, 02/26/08).