A Baylor College of Medicine team has come up with an automated approach for using long Pacific Biosciences RS reads to upgrade existing draft genomes, filling in and narrowing gaps in assembly scaffolds.
In a paper published in PLoS One last month, the researchers described the method, which involves iteratively weaving PacBio long reads into a consensus sequence to fill in missing bits of sequence between contigs in a scaffold in a reference-guided fashion — an approach that makes it possible to carry forward much of the existing annotation for the draft genome during this upgrade process.
The team has released a related software tool, dubbed PBJelly, through the open source site SourceForge.
"[PBJelly] verbosely logs all improvements to the draft genome," the authors wrote, "which enables identification and rejection of questionable gap fills, and production of an annotation coordinates lift-over table."
For the study, investigators highlighted the feasibility of using PBJelly to upgrade a simulated Drosophila melanogaster genome before applying it to the draft genome of another fruit fly, D. pseudoobscura. With 24-fold mapped PacBio coverage, they closed or improved more than 80 percent of the gaps tackled in the D. pseudoobscura genome, for example.
Similarly, the team showed that they could improve draft genomes for a bird, the budgerigar, and a primate, the Sooty mangabey, by plugging those draft genomes into PBJelly along with PacBio reads representing 4-fold and 6.8-fold mapped coverage of the genomes, respectively.
"Those projects were done using Illumina as the primary dataset," explained the BCM Human Genome Sequencing Center's Kim Worley, a co-author of the new study. In contrast to hybrid assembly methods that mesh PacBio reads with other read data from the earliest stages of assembly, "we assembled the Illumina data and then used the PacBio data to fill the gaps."
In its current iteration, PBJelly does not address some of the most difficult-to-finish parts of genomes, such as very large sequence gaps or centromeric regions of chromosomes.
But the team is looking at ways of continuing to improve PBJelly so that it can be applied for filling an even broader range of scaffold gaps in draft genome assemblies.
"The method as it is today addresses gaps within scaffolds," co-author Stephen Richards, a researcher with BCM's Human Genome Sequencing Center, told In Sequence. "We hope to, in a future version, address gaps between scaffolds."
"I think it makes tremendous sense — especially if you have a pretty good Illumina-only assembly and low coverage of long reads — to really try to focus in the analysis of those long reads to patch gaps," Michael Schatz of Cold Spring Harbor Laboratory told In Sequence.
Schatz was not involved in the current study, though he contributed to the development of AMOS (IS 5/20/2008) — open source software that preceded PacBio's de novo assembler ALLORA, which is part of the PBJelly pipeline.
At the Plant and Animal Genomes meeting early this year, Schatz outlined his team's strategy for assembling genomes de novo using a hybrid method that error-corrects long PacBio reads using reads from other platforms such as Illumina or Roche 454 (IS 1/24/2012).
That approach, called pacBioToCA, spearheaded by Sergey Koren with the University of Maryland and the National Biodefense Analysis and Countermeasures Center, was further described in a Nature Biotechnology paper this summer.
Researchers from the Broad Institute are also using PacBio reads for de novo hybrid assembly. Earlier this year, the Broad's David Jaffe and his colleagues published a study in Genome Research highlighting the approach they're using to combine PacBio long reads with two types of Illumina short reads in an effort to achieve automated bacterial genome finishing (IS 8/7/2012).
Such efforts have stemmed from a renewed interest in bringing draft genomes closer to completion through assembly improvements and upgrades — something traditionally done using expensive and laborious manual finishing steps based on targeted Sanger sequencing.
"[E]ven the best draft genomes contain gaps and other imperfections due to limitations in the input data and the techniques used to build draft assemblies," the authors of the new study noted. "Sequencing biases, repetitive genomic features, genomic polymorphism, and other complicating factors all come together to make some regions difficult or impossible to assemble."
Nowadays, though, researchers are increasingly turning to newer, more cost-effective strategies for tackling missing bits of genome sequence and other assembly shortcomings.
On that front, the long reads available on the PacBio instrument — often thousands of base pairs apiece — have proven particularly appealing.
Though the error rate for individual PacBio reads is on the order of 15 percent, the random nature of these errors makes it possible to combine the reads into a far more accurate consensus sequence that does not show systematic biases against certain sequences in the genome.
"We saw the PacBio long reads as an opportunity to address remaining gaps in existing draft genomes," Worley said.
Rather than error correcting the PacBio reads with another type of read data, though, she and her colleagues focused on using the long reads to fill in missing bits of sequence between contigs in draft assemblies built around Illumina short read data.
"If you use a method of error-correcting the PacBio sequence with Illumina data, then you're impinging the limits of the Illumina representation onto the PacBio sequence," Worley said. "So you may not get all the benefits if you approach it that way."
Generally speaking, the PBJelly approach involves mapping long PacBio reads to an existing draft genome, paying particular attention to the reads associated with gaps in that assembly.
Well-supported reads that address a given gap are retained and used to generate a high-quality consensus sequence that represents the collection of long reads that span all or part of a gap, filling in or reducing the size of a gap.
"A gap is considered closed when its neighboring contigs are connected by constructed sequence," the authors wrote. "A gap is improved by extending the neighboring contigs into the gap, although the entire gap sequence remains unresolved."
In their PLoS One study, for instance, the researchers addressed 99 percent of the more than 6,000 gaps in a D. pseudoobscura draft genome using PBJelly and 24-fold mapped PacBio coverage, closing 69 percent of these gaps, improving 12 percent of the gaps, and bumping the contig N50 up to 224 kilobases from 53 kilobases.
With 4-fold mapped PacBio coverage of the budgerigar genome and a draft genome assembled with Assemblathon 2.0 data, meanwhile, they upgraded more than 10,000 of the gaps identified in the initial assembly, increasing the contig N50 from 134 kilobases to 233 kilobases.
For the Sooty mangabey genome, the addition of 6.8-fold mapped PacBio read data to a preliminary assembly made it possible to pare down gaps by more than 118.2 bases with PBJelly, leading to a contig N50 jump from around 35 kilobases to more than 128 kilobases.
While the single-pass error rate for individual PacBio reads remained around 15 percent, the researchers saw that the consensus sequences they were able to generate by combining these reads were much more accurate.
"The consensus quality on the input assembly is more than 95 percent around a gap and the gap filling data is also more than 95 percent accurate," Worley said.
PBJelly should theoretically be useful for upgrading genomes of any size, the study's first author, Adam English, told In Sequence, though very large genomes will require more computational power.
Even so, because individual read mapping and gap assemblies occur independently from one another in PBJelly, he explained, the software can parallelize these processes by taking on small chunks of data at a time, speeding up the assembly upgrade.
Based on their results so far, the researchers do not have a specific recommendation for the mapped PacBio read depth needed to improve genomes with PBJelly, since this seems to depend on the quality of the original draft assembly.
In particular, it's typically more difficult to fill gaps with the PacBio long read data when the sequence surrounding a given gap is of poor quality, Worley noted. "If you trim back those ragged edges, you may do a better job of filling those contigs," she said. "So the state of the incoming assembly can impact the results."
There are also hints that the quality of the genomic DNA used for generating the added PacBio long read data affects the extent to which this data can be used to successfully close or improve scaffold gaps.
Even so, results so far suggest that there is not much additional information that can be gained by generating more than 15- to 20-fold PacBio sequence, Richards explained, and there are indications that somewhere on the order of 10-fold PacBio coverage should suffice for upgrading genomes in most cases.
"I think most people would be happy with 10X," he said. But it's very dependent on assembly features, average gaps sizes, and that kind of thing."
Authors of the current study did not put a firm price tag on the approach, which varies depending on genome size, depth of PacBio coverage, the version of the PacBio chemistry used, and so on.
"We've had a fair number of improvements coming to the PacBio machine over the past year," Richards noted. "So an improvement that doubles the yield of the SMRT cell can make quite a big difference in the price of this."
He explained that the PBJelly method for genome upgrade is "incredibly cheap" compared with manually finishing a genome, but far more expensive than generating a rough draft assembly with low-coverage, high-throughput sequence data alone.
"I think if people are going to compare the cost of this method, they have to compare it to manual, PCR-based, directed Sanger finishing," Richards argued. "In that regard I think it's very cost effective. But that's not something you do for every genome."
For instance, he noted that in the immediate future, the technology may be best applied to reference genomes used by large communities, "where closing the gaps really makes a difference to a lot of people."
"There's a lot of interest in trying to improve on some of the important model organism [genomes]," said CSHL's Schatz. "If you have a good assembly to start with, it makes sense to try to apply the PacBio reads in a very focused way."
But Schatz also noted that the appropriate sequencing and assembly strategy for a given genome may depend on the biological features of the organism itself. "If it's [a genome with] higher ploidy, higher heterozygosity, higher repeats — all these complicating factors may push you in one direction or the other."
For his part, Schatz said he is trying out PBJelly for some preliminary genome upgrade analyses on a rough draft assembly of the wheat genome, known for its size and complexity.
Generally speaking, Schatz said he believes the PBJelly gap-filling method is complementary to the up-front error correction approach that he and his colleagues have developed for doing de novo hybrid assembly with PacBio reads.
"[PBJelly] is a really good idea for certain amounts of coverage and certain read lengths," he said. "But I do think it's an open research question as to what's the right strategy for genomes in general."