Illumina’s sequencing technology was used by three research teams in China, the UK, and the US that each sequenced a human genome and published their results last week.
The three studies, which appeared in Nature last week, are the first published papers describing how Illumina’s Genome Analyzer platform was used to resequence a human genome at high depth. They also increase the number of peer-reviewed individual human genomes from two to five.
The studies analyzed the genomes of a HapMap sample of African origin, a Han Chinese individual, and a leukemia patient.
The cost of sequencing was lower in all three studies than in previous human-sequencing projects. In addition, the technology has improved while the three projects were underway and since they were completed, and subsequent studies using the same platform will cost even less to generate similar data, according to researchers.
The Nature studies are “proof of concept” that second-generation sequencing technologies “can be deployed to discover human DNA variation in an accurate and complete manner” David Altshuler, a professor of genetics and medicine at Harvard Medical School and co-chair of the 1000 Genomes Project, said by e-mail.
Technically and analytically, the three studies are very similar: They sequenced their respective genomes to comparable depths of coverage, aligned the reads to the same NCBI public reference sequence, and examined SNPs and other variants in a similar manner, according to David Bentley, Illumina’s chief scientist (see table below for a comparison).
But the studies used paired-end reads — which Illumina introduced commercially earlier this year — to varying degrees, so “each study illustrates the fast-evolving pace of a new technology,” he said in an e-mail message last week.
One of the studies, led by researchers at Illumina, sequenced YRI HapMap sample NA18507, which was derived from a male Yoruban from Ibadan, Nigeria. The company presented preliminary results from the project at a conference earlier this year (see In Sequence 2/26/2008).
And besides the three genomes published last week, there are at least six more human genomes being sequenced on the Illumina platform, according to Bentley. Among them are two relatives of the YRI sample, which Illumina said this spring it was sequencing, as well as a woman sequenced by the Leiden Genome Technology Center in the Netherlands (see In Sequence 6/3/2008).
Bentley also disclosed that the 1000 Genomes Project has sequenced the CEPH trio — one of two HapMap trios sequenced at high depth during the project’s pilot phase— using the Illumina platform.
YRI NA18507
According to Illumina’s Nature paper, the scientists generated 135 gigabases of sequence data, or 4 billion paired 35-base reads, covering the genome at an average depth of 40X, over a period of eight weeks starting at the end of 2007. They generated only paired-end reads using a 200-base-pair short-insert library and a 2-kilobase long-insert library. The scientists identified approximately 4 million SNPs, 400,000 short indels, and about 5,700 structural variants.
The sequencing cost $250,000, a sum that only includes reagent list prices and no labor, overhead, or analysis costs.
Paired reads, they point out in their paper, improved the accuracy and coverage of the data and “were essential for developing our short indel caller, and for detecting larger structural variants.” The researchers also resolved some structural variants by assembling paired-end data de novo.
“Each study illustrates the fast-evolving pace of a new technology.” |
Applied Biosystems has been sequencing the same YRI HapMap sample using its SOLiD sequencing technology. The company has shown data from the project at conferences but has not published it yet. According to Kevin McKernan, senior director of scientific operations at ABI, he and his colleagues plan “to validate our Yoruban study” using the improved SOLiD 3 platform and submit their study for publication early next year.
“We believe that the next step is a more in-depth characterization of structural variation in the human genome,” he said in an e-mail message. “In tandem, researchers will continue to develop a comprehensive view of the human transcriptome, its variants, and regulators of gene expression.”
Using improvements to Illumina’s platform, which can now obtain 15 to 20 gigabases per run, the same amount of data as in the published study can now be generated for $50,000 in reagent costs, according to Bentley. Also, “depending on the project requirement,” a lower coverage of 20X to 30X may be sufficient, translating to reagent costs of $25,000 to $30,000.
That compares favorably to other personal genome projects. Last year, scientists published Craig Venter’s genome, which they sequenced using Sanger technology, at a reported cost of about $70 million (see In Sequence 9/4/2007).
And earlier this year, researchers at 454 Life Sciences published Jim Watson’s genome, the first human genome published that was sequenced with a second-generation technology. The cost of the project was about $1 million, including reagents, labor, and depreciation but excluding analysis costs (see In Sequence 4/22/2008).
Today, using the new Titanium chemistry, it would cost approximately $350,000 in reagents to generate 36 gigabases, or 12-fold coverage of a human genome on the 454 platform, according to Michael Egholm, 454’s vice president of research and development — enough to detect 99 percent of heterozygous variants in the genome. “I think it is important that the scientific community can now scrutinize and compare the five genomes,” he said in an e-mail message.
Han Chinese
The second study, led by researchers at the Beijing Genomics Institute at Shenzhen, sequenced the genome of an anonymous Han Chinese man with no known genetic diseases.
The institute announced the project about a year ago (see In Sequence 9/25/2007) and presented some results at a recent conference (see In Sequence 9/30/2008).
The Shenzhen team generated 3.3 billion 35-base-pair reads, or 117.7 gigabases of data — 72 gigabases from unpaired reads and 45.7 gigabases from short paired-end reads with 135-base pair and 440-base pair inserts — or 36-fold average coverage of the genome.
They discovered approximately 3 million SNPs, about 135,000 small indels, and roughly 2,700 structural variations. Like the Illumina scientists, the Shenzhen researchers assembled a small percentage of the reads de novo. Their study was limited by the small paired-end insert sizes in what size of insertions it was able to discover.
AML
The third study, by a team led by Washington University St. Louis, sequenced the genome of a female patient with acute myeloid leukemia, both a sample of her skin and from her tumor (see related article, this issue).
According to Elaine Mardis, co-director of the Genome Center at Wash U, who presented the results of the AML project at a recent conference, the data was generated between August 2007 and early this year (see In Sequence 10/14/2008).
The scientists obtained 98 gigabases, or about 33-fold coverage, worth of 32-base pair-reads for the tumor genome and 42 gigabases, or about 14-fold coverage, of 35-base-pair reads for the normal skin sample. They discovered about 2.65 million single nucleotide variants in the tumor, of which 2.58 million were also present in the skin genome. They also found about 700 putative small indels in coding exons. After further analysis, they reported 10 genes with acquired mutations in the coding portion of the tumor genome.
Unlike the other two studies, this one focused on disease, examining “the premier medical research application of cancer,” according to Bentley.
The study “demonstrated as a possibility” that researchers could sequence the genomes of individuals for healthcare applications rather than research studies, said Richard Gibbs, director of the Human Genome Sequencing Center at Baylor College of Medicine.
A New Era
Commenting on the Illumina and BGI studies in a review in the same issue of Nature last week, Sam Levy and Bob Strausberg from the J. Craig Venter Institute point out that for technical reasons, both studies preferentially detected deletions over insertions.
Also, like Venter’s and Watson’s genomes, they “do not accurately define copy-number variants at the nucleotide level,” and do not provide a complete haplotype assembly.
“This is just the beginning of the era of the individual genome,” they write. “Soon, association studies using complete individual genomes will become the approach of choice for understanding the complexity of human biology and disease.”
By the Numbers: Published Human Genomes Sequenced Using Illumina's GA | |||||
Yoruba (YRI NA18507) | Han Chinese | AML Patient | |||
Bases | 135 Gb | 117.7 Gb (72 from fragment reads, 45.7 from paired-end reads) | 98 Gb (tumor) 41.8 Gb (skin) | ||
Reads (billion) | 4 | 3.3 | 3 (tumor) 1.2 (skin) | ||
Average depth | 40X | 36X (22.5X fragments, 13.5X paired reads) | 33X (tumor) 14X (skin) | ||
Average read length | 35 bp | 35 bp | 32 bp | ||
DNA libraries | 200 bp short insert library 2 kb long insert library | 8 fragment libraries 135 bp insert library 440 bp insert library | Fragment libraries (4 tumor, 3 skin) | ||
Alignment tool | MAQ Eland | SOAP | MAQ | ||
SNPs | Appr. 4 million | 3.1 million | 3.8 million (tumor) 2.9 million (skin) | ||
SNPs not in dbSNP | Appr. 1 million | 420,000 | Appr. 1.4 million (tumor) | ||
Indels | 400,000 | 135,300 | N/A | ||
Structural variants | 5,700 | 2,700 | N/A | ||
Cost at time of study | $250,000 (reagents only) | < $500,000 (total) | $700,000 (total) | ||
Estimated cost today | $50,000 (reagents only) | N/A | $200,000 (total) |