In anticipation of large-scale human resequencing studies like the 1,000 Genomes Project, Illumina and Applied Biosystems have separately been testing how comprehensively, quickly, accurately, and inexpensively their respective next-gen sequencing platforms can sequence a human genome.
At the Advances in Genome Biology and Technology conference in Marco Island this month, researchers from both companies presented preliminary data on the analysis of the same individual, a HapMap sample of African origin. Their results will be compared with those generated by other methods in order to determine how well the two platforms fare in finding SNPs and structural variants.
Both companies have been sequencing a HapMap sample, an anonymous African Yoruban man from Ibadan, Nigeria. “This particular sample will be one of the best studied regarding all forms of variation,” Evan Eichler, an associate professor in the department of genome sciences at the University of Washington in Seattle, told In Sequence by e-mail.
Eichler leads an NHGRI-sponsored study to characterize structural variations in this and other human genome samples by Sanger-based fosmid sequencing (see In Sequence 5/15/2007) that has recently been accepted for publication in a scientific journal.
In terms of structural variations, results from this project will serve as a benchmark for next-gen sequencing technologies. “The litmus test will be comparing the same samples where we have generated high-quality sequence across structural variants using the fosmid-based approach against Illumina’s and ABI’s SOLiD predictions,” Eichler said.
Illumina’s and ABI’s efforts are not the only next-generation whole-genome human sequencing projects: Last May, researchers from 454 Life Sciences and Baylor College of Medicine said they sequenced the genome of Jim Watson using 454’s platform, a study that has yet to be published (see In Sequence 6/5/2007).
And last fall, scientists at the Beijing Genomics Institute said they used Illumina’s platform to generate a first draft genome from a Chinese researcher, a study they intend to publish (see In Sequence 9/25/2007).
HTP ‘For the First Time’
For six weeks, lIllumina devoted eight of its sequencing machines — half of the capacity of its the former Solexa site in Little Chesterford, UK — to generate approximately 77 gigabases of data from 27 paired-end runs, according to David Bentley, the company’s chief scientist.
“We really went into high-throughput production sequencing internally for the first time” with this project, he told In Sequence during the AGBT meeting.
Each of these runs, which did not make use of the hardware upgrades to the system that Illumina presented at the Marco Island meeting (see In Sequence 2/19/2008), generated 3 gigabases of data on average, with the top run yielding nearly 4 gigabases. Ninety-five percent of the runs were successful, generating high-quality data with greater than 99-percent accuracy per read, according to Bentley.
Most of the runs used 200-base inserts, and a few used 2-kilobase insert libraries. In total, the sequence, which represents approximately 20-fold coverage, covers about 92 percent of the human reference genome. The scientists aligned them to the reference using Eland, Illumina’s aligner, as well as Velvet, an assembler developed by researchers at the European Bioinformatics Institute.
“The litmus test will be comparing the same samples where we have generated high-quality sequence across structural variants using the fosmid-based approach against Illumina’s and ABI’s SOLiD predictions.”
The alignment includes repeat regions, Bentley noted. “Given that half of [the genome] is acknowledged to be repetitive, it means … we are sequencing across most of what people traditionally call repeats, [including] most of the Alu repeats and LINES, as well as across all the genes and exons,” he said. “All we are missing are recent segmental duplications, the active interspersed repeats, and of course the actual gaps in the reference,” although some Illumina reads even cover some of these gaps.
The researchers called 3.7 million potential SNPs, of which 2.7 million were already in dbSNP. In order to be called, each SNP had to be present in at least three sequence reads.
Comparing their data with genotyping data generated on Illumina’s HapMap550 chip and by the HapMap project, the scientists found that about 98 percent agreed. The overwhelming majority of the disagreements were undercalls, Bentley pointed out, but additional sequence coverage is likely to reduce them.
The researchers aim to achieve a false-positive SNP-calling rate of 1 per 500 kilobases, “which is what geneticists would be very happy with,” given the polymorphism rate in humans is 1 SNP per kilobase, Bentley said.
They have also been able to call structural variants, such as copy number variants, deletions, insertions, and inversions, though Bentley did not say how many.
In addition, they have assembled de novo reads in the datasets that did not align to the human reference genome. Some of these assembled regions, Bentley said, align to the Celera assembly of the human genome.
While the company has been analyzing data from the first 27 runs, it has generated more data that add approximately 48 gigabases, he said, bringing the total up to 115 gigabases, or about 30-fold coverage.
In addition, runs with even longer library inserts of 4 and 5 kilobases are “on the machines at the moment,” Bentley said.
Going forward, Illumina plans to sequence more human genomes, both in collaboration with external partners and internally. “We would like to do more genomes, we would like to do more in research, we would like to learn how to do it better, we would like to fold in the longer-insert paired reads and get a more complete picture,” Bentley said.
ABI’s ‘Side Project’
Applied Biosystems has not yet generated as much sequencing data as Illumina on the African HapMap sample. “This has been a bit of a side project for us” that is still ongoing, Kevin McKernan, senior director for scientific operations for high throughput discovery, told In Sequence.
For ABI, the goal of the project, which began last summer, is to determine the minimum amount of coverage needed to find the greatest number of polymorphisms, said McKernan. “It’s very pertinent to the 1,000 Genomes Project. They are trying to figure out what’s the basal coverage they need for one of these [genomes],” he said. “The two-base encoding [used in SOLiD sequencing], in fact, does a very good job at picking up SNPs at low coverage.”
So far, ABI has analyzed 2.5 paired-end runs, adding up to approximately 21 gigabases of data, or 7-fold sequence coverage. These runs used five insert libraries with different sizes, each on a separate slide: 600 bases, 800 bases, 1,200 bases, 1,700 bases, and 2,800 bases.
Making use of the two-base encoding scheme of the SOLiD sequencing system, the scientists identified 7.5 million positions with adjacent color changes in the reads, or differences from the reference that are not based on “random errors.” Of those, approximately 2 million are in dbSNP. Of the remaining 5.5 million, about 580,000 were confirmed by more than one read. “As the coverage increases, this is likely to move to 1 to 2 million [SNPs] which have multiple reads confirming their presence,” McKernan said. “At 5-fold sequence coverage, it’s easy to miss heterozygote SNPs simply due to theoretical sampling limitations.”
The researchers also detected structural variation in the genome. So far, they found 54,000 small insertion/deletions by scanning inside reads, and 50,000 large-scale variations by looking for changes in the insert size of paired-end reads.
ABI plans to increase the sequence coverage to 10- to 12-fold eventually. With that much sequence data, “we are probably going to see all the heterozygotes we need,” McKernan said.
Last summer, the researchers performed a single-fragment run on the HapMap sample that generated 6 gigabases of data (see In Sequence 10/16/2007) but ABI did not include the analysis of that run in its presentation at Marco Island, which focused on paired-end data.
McKernan and his colleagues are currently comparing their data to the Sanger-based fosmid-sequencing data generated by Eichler’s consortium “to get the correlation between our structural variation and what he found,” McKernan said.
According to Eichler, the biggest challenge next-gen sequencing platforms face is to determine structural variants in duplicated or repeated DNA regions where it is difficult to place short reads, even when they are paired.
“Next-gen technologies are opportunistic and will detect thousands of events in unique sequence but fare less well in repetitive regions,” he said. “Many genes [involved in] disease and disease susceptibilities occur in these regions, so missing these is an important challenge unlikely to be met by current next-gen sequencing technologies,” he predicted.
He and his colleagues have already compared their fosmid-based data with paired-end data generated by researchers at Yale University using 454’s technology (see In Sequence 10/2/2007), which yields 250-base reads, “and there is a deficiency in that technology precisely over regions associated with duplicated genes and gene families,” Eichler said.
How Much Is It?
However, even if the new sequencing technologies are not going to cover structural variants as comprehensively as Sanger sequencing, they will lower the cost of human genome sequencing.
To be sure, the cost of the final projects will depend on how “a human genome” is defined in terms of completeness and accuracy. The cost will vary greatly depending on the coverage and whether the companies count reagent costs only or “fully loaded” costs.
Despite this, the companies provided some early hints. Illumina CEO Jay Flatley said in interviews this month that the cost of sequencing the African genome was $100,000, but he did not say how many runs this included, and whether this cost counted anything besides reagents.
Bentley declined to put a number on the cost of the project. “Clearly, this is our first genome, and we learned a lot about it, so it wouldn’t be at our target costs.“ However, the $100,000 genome “remains an important goal for us in the coming months,” he said.
The list price for a paired-end run on Illumina’s current Genome Analyzer is $5,400, putting the reagent cost for 27 runs, or 20-fold coverage, at almost $150,000. However, after Illumina upgrades its system, the same amount of data could be obtained with fewer runs. Also, “as the output of the system continues to increase, we expect substantial drops in the cost of sequencing full genomes,” Bentley said.
ABI’s McKernan said that sequencing a human genome at 10- to 12-fold coverage on the SOLiD platform would currently cost approximately $40,000 to $50,000 in reagents, based on five to six paired-end runs at a price of approximately $8,000 per run.