Large sequencing centers in the US and the UK have begun applying their high-throughput sequencing platforms to human resequencing projects and presented initial data from several studies at last week’s Biology of Genomes meeting at Cold Spring Harbor Laboratory.
Richard Durbin, a principal investigator at the Wellcome Trust Sanger Institute, gave an update on the 1,000 Genomes Project, which was launched in January (see In Sequence 1/22/2008).
Five institutes are generating sequence data for the project, which uses high-throughput platforms from 454 Life Sciences, Illumina, and Applied Biosystems: the Sanger Institute, Beijing Genomics Institute Shenzhen, the Broad Institute of MIT and Harvard, Washington University School of Medicine’s Genome Sequencing Center, and Baylor College of Medicine’s Human Genome Sequencing Center (see also Transcript in this issue).
The project is poised to generate “vast amounts of data,” according to Durbin, who co-chairs the project’s steering committee. By the end of this year, the 1,000 Genomes consortium plans to complete three pilot projects and generate a total of 2 terabases of sequence data, he said.
Since the consortium started sequencing its first samples in February, it has generated more than 300 gigabases of sequence data, he said, much of which has been submitted to short-read archives at the National Center for Biotechnology Information and the European Bioinformatics Institute.
That amount of data is already more than what is currently stored in GenBank, which as of February contained almost 86 gigabases of sequence data in its traditional divisions, and almost 109 million gigabases in its whole-genome sequencing division.
Under the first project, consortium scientists have been sequencing three batches of 60 HapMap samples from three different populations at low coverage. Samples with Northern European origin from Utah will be covered with fourfold redundancy, while samples originating from Nigeria and East Asia — China and Japan — will each be sequenced with two-fold redundancy. Ten of these genomes are now complete, according to Durbin, who did not mention which sequencing platforms have been used in this pilot.
The second pilot project is an in-depth sequence analysis of two HapMap trios, each consisting of parents and a child — one trio originating from Northern Europe and the other one from Nigeria. These are currently being sequenced with 20-fold coverage. The European trio is mostly complete, Durbin reported, while the African samples are at an earlier stage.
Chad Nusbaum, co-director of the Broad Institute’s genome-sequencing and analysis program, said in a separate talk that researchers at his center have generated “deep coverage” of the progeny of the European trio using the Applied Biosystems SOLiD platform. Within six weeks, they generated more than 50 gigabases of sequence from 4.5 runs on two instruments, churning out 13.4 gigabases of sequence data in their best run. Using ABI’s software, they have also started calling SNPs in the data.
The project has already generated more data than is currently stored in GenBank.
Durbin said that the third pilot, which aims to sequence 1,000 genes in 1,000 HapMap individuals, has recently started.
Following a data-freeze in mid-April, consortium members have been conducting preliminary analyses of the data, he said, which consisted at the time of 185 gigabases from the European trio, 20 gigabases from the African trio, and 32 gigabases from the low-coverage HapMap samples.
As part of the analysis, researchers have called SNPs and short indels in the European trio. While false-positive calls were rare, almost all false-negative calls involved heterozygote SNPs, Durbin reported, suggesting that a coverage of more than 20-fold is needed in the trios to be able to see both alleles consistently.
The analysis of structural variations, captured by paired-end sequence data, is still ongoing and involves several groups, he said. They also plan to perform a phylogenetic analysis of the mitochondrial DNA, which is covered very deeply in the data, he added.
Other genome center groups at the conference presented initial data from cancer genome sequencing projects, involving both Illumina’s Genome Analyzer and ABI’s SOLiD platform. Such projects are expected to increase in scale over time as the International Cancer Genome Consortium gains momentum (see In Sequence 4/29/2008).
The ICGC, which launched last month, serves as an umbrella for various independently funded cancer genome projects worldwide and hopes to sequence 25,000 samples from 50 different cancer types.
For example, Rick Wilson, director of the Genome Center at Washington University School of Medicine, reported that his center has recently used Illumina’s Genome Analyzer to sequence the genome of an acute myeloid leukemia sample from a deceased patient.
The researchers, he said, have sequenced the genome at 30-fold coverage, generating almost 95 gigabases of data from unpaired sequence reads deriving from 98 runs. That coverage allowed them to detect more than 95 percent of SNPs from a set of known SNPs that were produced by microarray analysis, he said, and discover somatic mutations in four cancer-related genes.
In addition, they have sequenced a normal skin sample from the same patient at 12-fold coverage, generating 37.1 gigabases of data.
Analyzing the AML genome, they found almost 4 million SNPs, more than 25 percent of which are shared with both the Venter genome and the Watson genome, he said.
In the next year, the Genome Center plans to sequence five more AML genomes using second-generation sequencing technologies, and to analyze them for somatic mutations. The researchers also want to sequence the transcriptomes of these samples, according to Wilson.
Meanwhile, researchers from the Sanger Institute’s Cancer Genome Project team have been sequencing lung cancer and breast cancer samples using paired-end reads from Illumina’s Genome Analyzer and ABI’s SOLiD platform.
Recently, the group published a study in Nature Genetics in which they used low-coverage paired-end Solexa sequencing of short DNA fragments, 200 to 500 base pairs in length, to identify several hundred somatic and germline mutations, including rearrangements, insertions, deletions, and copy number variations, in two lung cancer cell lines (see In Sequence 4/29/2008).
At last week’s conference, Erin Pleasance, a researcher at the Sanger Institute, showed similar data for a breast cancer cell line, which had more complex rearrangements than the two lung cancer samples.
Andy Futreal, co-head of Cancer Genome Project, presented another study at the conference, in which his team, in collaboration with ABI researchers, sequenced a different lung cancer cell line and its matching lymphoblastoid control sample on the SOLiD platform. Fifteen megabases of exon sequences in that cell line had previously been characterized by PCR-based Sanger sequencing, providing the researchers with a reference set of SNPs to compare their results with.
Using a mix of sequence libraries with 600-base-pair, 2-kilobase, and 4-kilobase inserts, the researchers generated a total of approximately 59 gigabases of data, about 8-fold coverage for the tumor and 12-fold for the normal sample. The Sanger Institute produced about 40 gigabases of data while ABI generated the remaining 19 gigabases.
So far, the scientists have analyzed the data at 4-fold sequence coverage and have found that 85 percent of the known variants are covered by at least one sequence read. Comparing the data with Sanger data from the same sample, they found that the error rate after filtering was less than 0.1 percent.
According to Futreal, the scientists are aiming to push the sequence coverage to 20-fold for each of the samples, so they can start calling de novo variants in the “not-too-distant future.”