SAN FRANCISCO – An official from the Beijing Genomics Institute said this week that BGI has developed a short-read de novo assembly algorithm that is capable of assembling a human genome.
Wang Jun, associate director of the BGI, told attendees of Cambridge Healthech Institute's Genomic Tools and Technologies Summit here that a paper on the method is currently under review, and that his team has successfully used it to assemble the panda genome, which it is sequencing with the Illumina Genome Analyzer.
The panda genome assembly "shows that human assembly is possible," he said, though he did not disclose any performance benchmarks or assembly statistics for the approach.
The algorithm, called SOAPdenovo, is part of the SOAP (Short Oligonucleotide Analysis Package) suite of tools developed at BGI, which also includes an alignment tool called SOAPaligner, a resequencing consensus sequence builder called SOAPsnp, an indel finder called SOAPindel, and a structural variation finder called SOAPsv.
BGI launched version 1.03 of SOAPdenovo this week. According to the BGI website, it was developed to handle "large plant and animal genomes," but also works well on bacteria and fungi genomes. SOAPdenovo runs on 64-bit Linux and requires 150 GB of memory to assemble a human-sized genome.
BGI is not alone in its quest to develop an effective de novo assembly algorithm that can work with human-scale genomes. Other academic groups and commercial bioinformatics firms are also developing such tools. CLC Bio, for example, claims that a new assembler it is developing can assemble a human genome with 17 hours of CPU time and 32 GB of RAM [BioInform June 5, 2009].
Wang said in his talk that de novo assembly is a key requirement for human genome sequencing because it is much better at detecting structural variants than current resequencing approaches, which align the short reads to a reference genome.
As an example, he noted that his team used the new algorithm to identify an 8 kilobase insertion in the Han Chinese genome that BGI published in Nature last fall.
For that paper, the BGI researchers aligned the reads to the National Center for Biotechnology Information's reference genome. Wang said that the researchers did not detect that insertion until they used SOAPdenovo. "We could only find it with de novo assembly, not mapping to the reference genome," he said.
Several other speakers at the conference also noted the shortcomings of alignment versus de novo assembly. Stephen Kingsmore, president and CEO of the National Center for Genome Resources, said that the NCBI reference sequence has many gaps, which has a "big impact" on resequencing projects that rely on alignment. In addition, he noted that there are "many errors" in the reference genome, and that it is not well annotated.
"We're really going to need a better reference genome if we're going to go with alignment over assembly" for resequencing, he said.