By Monica Heger
This story has been updated from a previous version to clarify information in the paper's supplementary material in response to outside comments.
Scientists at the Beijing Genomics Institute in Shenzhen reported this week in Nature that they sequenced and assembled a draft genome of the giant panda using only the Illumina Genome Analyzer platform and the SOAPdenovo algorithm for assembly.
The draft genome was sequenced to 73-fold coverage and was 94 percent complete. The study is one of the first in which researchers sequenced and de novo assembled a large mammalian genome using only short-read technology. Last week, the BGI team published results from de novo assemblies of two human genomes using the Illumina technology (see In Sequence 12/8/2009).
The researchers used a whole-genome shotgun sequencing strategy and constructed 37 paired-end libraries with an average read length of 52 base pairs. The N50 contig length was 40 kilobases, and the N50 scaffold length was 1.3 megabases. They estimated the size of the genome to be about 2.4 gigabases.
To predict the number of genes in the panda genome, they compared it to both human and dog genomes and created a reference gene set with 21,001 genes. They also identified an estimated 2,534 genes that are present in the panda genome but not present in the human, mouse, or dog genomes.
The sequence assembly was performed on a computer with 32 central processing unit cores and 512 gigabytes of random access memory. One round of assembly took a few days. The SOAPdenovo algorithm was key to the assembly's accuracy, lead author Jun Wang told In Sequence in an e-mail: "The different insert sizes [for] paired-end sequencing is a key factor here," he said. "SOAPdenovo uses the de Bruijn graph algorithm and applies a stepwise strategy to make it feasible to assemble the panda genome."
To evaluate the assembly accuracy of the scaffolds, they sequenced nine BACs using Sanger technology, and aligned them to the scaffolds. About 98 percent of the BAC regions were covered and the researchers did not see any major assembly errors. Wang said that the recent improvements to Illumina's platform would allow for increased read lengths that would have improved the assembly accuracy even more.
The panda project "shows the power of high-throughput sequencing and the use of shotgun combined with long paired-end reads," said Kjetill Jakobsen, leader of the consortium that recently sequenced the cod genome with 454 technology and a professor of biology at the Centre for Ecological and Evolutionary Synthesis at the University of Oslo.
However, William Davidson, a professor of molecular biology and biochemistry at Simon Fraser University who is involved in the project to sequence the Atlantic salmon genome, questioned a section in the paper's supplementary material that explains that gaps with tandem repeats that could not be resolved were filled in with Ns and that the contig N50 size grew from 1,483 bp to 39.9 Kb.
BGI's Wang explained in a follow-up e-mail that the Ns in the supplementary material referred to gaps between, not within, contigs.
In the supplementary data, "we explained that ... the tandem repeats that could not be deciphered had been put as 'N' in a scaffold, and seen as a gap between contigs."
Wang stressed that the the N50 contig size for the final assembly is 39.9 kb, while the N50 scaffold size is 1.3Mb.
The giant panda is an endangered species — there are only an estimated 2,500 to 3,000 left. Knowing the genome could help with conservation efforts. One interesting finding was that the panda appears to lack the genes that code for the enzymes needed to break down bamboo — its primary food source. Instead, the panda's genome suggests that it should be more carnivorous than it actually is. The authors speculate that the panda's bamboo diet must be due to its gut microbiome rather than its genetic makeup.
Wang said the study showed that short-read technology such as the Illumina Genome Analyzer is appropriate for generating mammalian draft genome sequences. "We think the assembly quality is comparable to traditional Sanger assembly, and can be used in annotation, comparative genomics, resequencing, and other analysis," he said.