Researchers at the Beijing Genomics Institute in Shenzhen have analyzed the genome of an anonymous Chinese man whom they sequenced a year ago using Illumina’s Genome Analyzer.
According to their analysis, which is scheduled to appear in an upcoming issue of the journal Nature, short-read sequencing technologies are well suited for sequencing large eukaryotic genomes, as long as a reference genome sequence is available.
BGI first started talking about the project, at the time dubbed the “First Asian Diploid Genome Project,” a year ago (see In Sequence 9/25/2007). The study is part of the larger Yanhuang project, which aims to sequence at least 100 Chinese individuals over three years. The project, announced in January (see In Sequence 1/8/2008), aims to study genetic polymorphisms in the Chinese population.
Last week, Laurie Goodman, a contract public information officer and editor for BGI-Shenzhen, presented results from the analysis of the first genome at Cambridge Healthtech Institute’s Exploring Next-Generation Sequencing conference in Providence, RI.
According to Goodman, who spoke on behalf of Jun Wang, associate director of BGI-Shenzhen, who was unable to obtain a visa to attend the conference, the full cost for the project was approximately $500,000. The project has been “accepted in principle” by the journal Nature, she said. BGI Shenzhen announced the acceptance of the manuscript on its website earlier this month.
Goodman said it took BGI scientists approximately two months to generate the data on five Illumina Genome Analyzers, each week yielding between 4 and 8 gigabases of high-quality data.
In total, the scientists generated 3.3 billion reads on the instruments, which they mapped against the NCBI reference genome, covering approximately 99.97 percent of it.
The total coverage of the genome was 36-fold. Of that, 22.5-fold coverage came from unpaired reads and 13.5-fold coverage from paired-end reads. The data covered the autosomes to 34-fold depth, and the X and Y chromosomes to 19-fold depth. The read length varied from 25 to 44 base pairs, though the majority of reads were 35 base pairs long.
The researchers identified more than 3 million SNPs in the genome, of which 13.6 percent were not contained in dbSNP. In addition, they discovered 135,000 small indels one to three base pairs in size, as well as 2,682 structural variants.
A comparison of the SNPs identified in the published genome sequences of Craig Venter and Jim Watson showed that the three share approximately 1.2 million SNPs.
The researchers also compared their results against SNPs discovered using the Illumina HapMap 1M BeadChip and found that they covered approximately 99.22 percent of those SNPs by sequencing. They also validated SNPs that were inconsistent between the two platforms by PCR-based Sanger sequencing and found that for more than 80 percent of the inconsistencies, the Illumina sequence data were accurate.
Unsurprisingly, an analysis of the genetic background of the Chinese donor revealed that he is 94 percent Asian.
The researchers concluded that the Illumina sequencing technology is well suited to resequence large eukaryotic genomes, such as the human genome, as long as a reference sequence is available.
They found that the technology allows for “extremely accurate” detection of SNPs and insertion or deletions up to 3 base pairs in size. However, a mix of long and short reads is still required to detect longer inserts.
BGI Shenzhen it is now applying the experience it gained from its first genome project to the 1000 Genomes Project, where it is responsible for sequencing Asian HapMap individuals as part of one of the pilot studies.