Researchers at the European Bioinformatics Institute are developing an algorithm for assembling very short reads of DNA produced by next-generation sequencers, a step that could enable de novo assembly of mammalian genomes — a task that many consider to be too complex for these systems.
The method, called Velvet, is being developed by EBI senior scientist Ewan Birney and Daniel Zerbino, a second-year PhD bioinformatics candidate at the institute.
Next-generation sequencing data is increasingly demanding new ways of assembling genomes, a search that has attracted several other suitors. Indeed, Birney and Zerbino join several other efforts underway to create new assembly algorithms for very short reads of around 30 base pairs, including projects at the Broad Institute, the British Columbia Cancer Center’s Genome Sciences Center, Stony Brook University, and elsewhere [BioInform 04-13-07].
Describing their approach at the Genome Informatics conference held at Cold Spring Harbor Laboratory Nov. 1-5, Birney and Zerbino Explained that Velvet has two modules: Tour Bus, which removes errors from the data; and Breadcrumb, which resolves repeated regions of the genome using paired reads.
The method diverges from traditional assemblers that are based on the so-called "overlap-layout-consensus" approach, which treats each read as a separate entity. Velvet instead relies on a directed graph representation called a de Bruijn graph, which organizes the data by word lengths, or k-mers, in which each k-mer appears as a single node in the graph regardless of how many times it is observed.
This approach is better than existing assemblers at accounting for the redundancy inherent in short-read data, according to Birney and Zerbino.
“Quite simply, it has a different structure” than other assemblers, Birney told BioInform in an interview after the conference. “The nodes are k-mers, subsequences of a particular type, and the edges show that the two k-mers were present, adjacent to each other, in an observed read.”
He added that such a graph produces a “radically different read” than other assemblers and is particularly suited for repeated regions of the genome.
As an example, he said that a perfect repeat may be present multiple times in a genome. “In the de Bruijn graph, there’s only one place in the graph where that perfect repeat is. Whereas in traditional assemblers, each read representing that repeat will now have multiple edges to the other reads crossing that repeat.”
Zerbino said the method was inspired by Euler, an assembly algorithm for Sanger data that is also based on the de Bruijn graph representation. Euler was originally published in 2001 by Pavel Pevsner of the University of California, San Diego, and Michael Waterman of the University of Southern California.
“Rather than using what [Pevsner] calls the Eulerian path to resolve the graph, we … resolved the graph using data — just rather classic data-processing aspects of the data — to improve the representation of the assembly using the de Bruijn graph,” Zerbino said. “So although both Euler and Velvet use de Bruijn graphs, they actually manipulate those graphs in very, very different ways.”
Birney said that Velvet was designed to satisfy a demand by next-generation sequencing data to come up with new ways of thinking about genome assembly. “Traditional assemblers just can’t deal with next-generation sequencing at all,” he said. “There are just too many reads.”
In addition, he said, the traditional focus on overlapping regions poses difficulties because the overlaps “would be much longer than the current length of these reads.”
The EBI researchers are not the only ones tackling this challenge. Steve Skiena of Stony Brook University is developing an algorithm called Shorty for the de novo assembly of read lengths of between 20 and 30 base pairs, and René Warren of the BC Genome Sciences Center has developed SSAKE (Short Sequence Assembly by progressive K-mer search and 3’ read Extension) for assembling 25-mers into longer contigs.
“Traditional assemblers just can’t deal with next-generation sequencing at all. There are just too many reads.”
Other short-read assemblers in development include VCAKE (Verified Consensus Assembly by K-mer Extension) from William Jeck and colleagues at the University of Carolina, Chapel Hill, and SHARCGS (Short read Assembler based on Robust Contig extension for Genome Sequencing) from Juliane Dohm and colleagues at the Max Planck Institute for Molecular Genetics.
Zemin Ning of the Wellcome Trust Sanger Institute, who co-developed the Phusion assembler for Sanger sequencing data, is also looking into the assembly challenges of short-read data. At the Genome Informatics conference, he discussed an algorithm for assembling paired-end Solexa reads by k-mer extension.
The approach has two steps, he explained to BioInform after the conference. First, it extends reads of 30 to 40 base pairs into normal reads “like traditional ABI category reads” of approximately one kilobase, Ning said. Then it uses a Sanger assembler such as Phrap or Phusion to complete the assembly.
Birney and Zerbino claim that one feature that sets Velvet apart from other methods is its Tour Bus error-removal function, which is designed to handle both sequencing errors and biological variations such as polymorphisms. The method “provides close to perfect error resolution of the de Bruijn graphs,” they note in the abstract for their talk.
In a poster describing Velvet on the EBI website, Birney and Zerbino claim that for contigs longer than 100 base pairs, the method had no misassemblies and showed error rates of .02 percent for a human BAC and .004 percent for Streptococcus suis.
At the conference, Paul Havlak of the National Human Genome Research Institute’s Genome Technology branch discussed the use of Velvet in a project to apply Solexa sequencing to finish several mammalian genomes with unresolved sequencing gaps.
Havlak and his colleagues selected 10 BACs from these species, pooled them, and sequenced them on an Illumina Genome Analyzer.
The best initial run of Velvet produced 3,222 contigs of at least 100 bases, with an average length of 316 bases, according to Havlak. “It’s pretty good for short reads,” he told BioInform, and for “trying to fill in gaps, extending from known sequence from Sanger reads.”
Havlak said that NHGRI began using the Illumina machine in July and that his group has not tried other short-read assemblers.
As to whether he’d use Velvet again is uncertain.
“I’d say the jury’s still out on whether this is a big payoff; but you don’t need to fill many gaps before it pays off,” he said.
Velvet is freely available here.
Bernadette Toner contributed to this report.