This article was originally published Aug. 6.
Broad Institute researchers have come up with a new method for stitching together Illumina short reads and long reads generated on the Pacific Biosciences platform that appears to be well-suited for de novo assembly and finishing of bacterial genomes.
The method is the latest in a series of approaches that have emerged in recent months to take advantage of the long PacBio reads to improve the quality of assemblies. These methods promise to help researchers produce high-quality, finished genomes in much less time and at much lower cost than is currently possible.
"We wanted to take this problem of finishing from an enormously expensive and time-consuming process to something that is much, much lower cost and essentially automated," senior author David Jaffe, director of computational research and development at the Broad Institute, told In Sequence.
In a paper appearing online last month in Genome Research, Jaffe and his colleagues demonstrated that such automated finishing is feasible with their hybrid assembly approach, which currently relies on long PacBio reads and two types of Illumina short read data.
Jaffe's team developed a new assembly algorithm for its approach that it folded into the ALLPATHS-LG program previously developed at the Broad Institute. Using this method, the Broad researchers put together finished genomes for 16 bacterial samples, including samples from three species with available reference genome sequences.
Comparisons with these reference genomes indicated that the hybrid assemblies were more accurate than existing references for two of the three species, prompting the study authors to assert that "assemblies exceeding finished quality can be obtained from whole-genome shotgun data and automated computation."
Based on their own success with the approach so far, Jaffe noted that the Broad may eventually offer a bacterial genome sequencing, assembly, and finishing service centered on the hybrid assembly method.
Co-author Carsten Russ, assistant director of emerging technologies with the Broad Institute's genome sequencing and analysis program, confirmed plans to establish such a service, though the timeline and anticipated cost per genome have not yet been hammered out.
Crossing the Finish Line
The introduction of low-cost second-generation sequencing systems has spurred a spike in the availability of draft genome sequences. But perfecting and filling gaps in these sequences with PCR and Sanger sequencing has remained an expensive and painstaking endeavor, Jaffe explained, involving incremental tweaks to the original assembly and manual correction of suspicious or incomplete sequences.
"This became an iterative process, which was why it was so time consuming," he said. "It could just go on and on and on, because some parts of the genome are extremely difficult to get right."
Consequently, he noted, only a few vertebrate-sized genomes have been finished so far, each over many years and at exorbitant cost.
The tally of finished bacterial genomes is a bit higher, owing to their much smaller size. But these genomes contain complex regions, too, and only a fraction of the bacterial genomes that have been sequenced have reached a finished form.
In an effort to streamline the finishing process and make it more affordable, Jaffe, Russ, and colleagues set out to develop an automated assembly method that could accommodate both short Illumina reads and longer, but more error prone, PacBio reads.
In addition to generating reads as long as a couple thousand bases, "the PacBio data offer things that are not found in the Illumina data," Jaffe said. For example, the PacBio RS does not require DNA amplification prior to sequencing, which leads to more uniform coverage of the genome that spans bits of the genome where Illumina coverage or sequence quality are sub par.
The team is not the first to explore the possibility of using PacBio reads for improving the quality of de novo genome assemblies.
At the Plant and Animal Genomes meeting earlier this year, Michael Schatz of Cold Spring Harbor Laboratory described the PacBio ToCA strategy that he and his colleagues are using to create hybrid assemblies based on PacBio reads in combination with reads generated on Illumina or Roche 454 instruments (IS 1/24/2012). The team published a study employing that hybrid assembly algorithm last month in Nature Biotechnology.
In the same issue of that journal, researchers from PacBio and elsewhere reported on a "scaffolding, overlap-layout-consensus, and error-correction methods" approach for creating bacterial genome assemblies comprised of PacBio, Illumina, and Roche 454 sequence data — a strategy that they applied to sequence and assemble a Vibrio cholerae isolate collected during Haiti's cholera outbreak in 2010.
As the notion of combining short read data with PacBio reads becomes more common, researchers have started to consider ways of more routinely harnessing the increased accuracy of these hybrid assemblies in the context of genome finishing.
In unpublished experiments done in conjunction with collaborators at the National Biodefense Analysis and Countermeasures Center, for instance, Schatz said his group's hybrid assembly method is making it possible to generate microbial assemblies with chromosome-size contigs in an automated manner.
Researchers at the Baylor College of Medicine are also among those trying their hand at using PacBio reads to finish or upgrade genome sequences. In an e-mail message, Baylor bioinformatics programmer Adam English explained that the PacBio reads offer a cheaper and faster alternative to traditional Sanger-based finishing, particularly for smaller genomes. He did not provide details on the genomes being upgraded at Baylor or the methods being used to assemble PacBio reads with those generated on other platforms.
Three Types of Data
For its part, the Broad team is currently using three kinds of read data to achieve finished bacterial genome assemblies.
Two of these read types — the so-called fragment reads and jump reads — are generated using the Illumina platform, Jaffe explained, while the third set of reads is produced on the PacBio platform.
Together, the three types of data provide resolution encompassing a range of read sizes — from the 100 or 200 bases present in the Illumina paired-end fragment reads to the much longer PacBio reads, which can span a few hundred to as many as one or two thousand base pairs.
Jump reads, which are generated by sequencing the junction fragments of long pieces of DNA that have been circularized by ligation, provide information from sites in the genome that are a few thousand bases apart.
"Basically these data are filling in at different size ranges in terms of power," Jaffe said. "The algorithm, in some sense, kind of follows from that — trying to squeeze out the different things that the data types have to offer."
Some stages of the assembly process overlap with those used to assemble Illumina data on its own, Jaffe explained, including preliminary steps to glue these short reads together into an initial sequence graph.
PacBio reads and, to some extent, jump reads can then be used to patch gaps in that graph. In addition, the researchers worked on ways to stack the PacBio reads on top of one another while unrolling them onto the sequence graph in order to do a rough error correction on these long reads — a step that Jaffe called the "nastiest part of the algorithm."
Once something of a consensus has been achieved, the error-corrected PacBio reads are taken forward and used as part of a new assembly graph.
"This approach, and ALLPATHS in general, is very high-quality software — very usable, very robust to different conditions," CSHL's Schatz told In Sequence. "They took a bit of a different tack than we did, but I would put it in the same family of approaches."
For the bacterial genomes assessed in the new Genome Research study, the Broad team aimed for a coverage depth of at least 50-fold for each type of read data, though Jaffe noted that the level of coverage used for such applications will likely come down as PacBio reads get longer and the quality of Illumina reads continues to improve.
The researchers sequenced 16 bacterial samples representing 14 different species. For three of these — Escherichia coli, Streptococcus pneumoniae, and Rhodobacter sphaeroides — finished reference genomes were already available, allowing for direct comparisons of some of the ALLPATHS-LG-based hybrid assemblies.
Based on extensive comparison and laboratory validation steps, the study authors concluded that two of the three hybrid assemblies were more accurate than the existing reference genomes. That fueled a years-long effort to fix one of these reference genomes, that of R. sphaeroides.
The third reference genome, that of E. coli, was already "extremely good," according to Jaffe, who attributed the high quality of the reference to the bug's medical importance and to extensive efforts to perfect the sequence by researchers working on E. coli.
In supplementary data included with the study, the team estimated that the automated method would shave nearly $12,000 off the reagent and labor costs of sequencing and finishing a five megabase bacterial genome compared to PCR and Sanger-based finishing of a genome sequenced and assembled using Illumina short reads alone.
One potential weakness in the methodology, according to Jaffe, is that the libraries used to produce the jump reads can vary in quality. The reason for that is not clear, he said, though there is some speculation that it could be related to the jump library protocol or even to methods used to isolate and prepare DNA from bacterial cells.
For his part, Schatz speculated that jump reads may eventually be phased out as longer and longer PacBio reads become available.
"Now that the long [PacBio] reads are maturing so you can get a lot of [5,000 base] and [10,000 base] reads … the days of the jump library are a little bit uncertain," he said.
The current version of the Broad's hybrid ALLPATHS-LG assembly method is being released as open-source software and Jaffe said its developers are "very actively trying to make the software as usable as possible."
He explained that future improvements to the software are likely to be contingent on advances made in the sequencing technology itself.
"We designed [the algorithm] based on the three-library mixture," Jaffe said. "That may be what we're doing a year from now, but there may be something else which changes the rules of the game."
"We're kind of waiting to see what the next turn in the technology brings," he noted. "At this juncture there's the possibility of either fundamental changes to sequencing technology or incremental but very important advances."
For the time being, researchers say the ability to obtain finished genomes using automated, hybrid assembly methods is limited to smaller genomes, in part due to the cost of producing PacBio reads.
"Bacterial genomes — small genomes — are kind of a sweet spot in that people would be willing to pay a couple thousand dollars, or some relatively modest amount, to get ultra-high quality sequence," Jaffe said.
Schatz said it remains to be seen whether PacBio can produce reads long enough to produce finished vertebrate-size genomes and, if so, what other types of read data would have to be included in that assembly.
"It's not clear that five or 10 or 15 thousand bases would be long enough to get a finished vertebrate genome," he said. "That's kind of an active research question: 'What's the right combination of libraries to do so?'"
But while automated genome finishing remains out of reach for large genomes at the moment, Schatz emphasized that the availability of PacBio reads can still improve the quality of de novo draft genomes compared to assemblies based on short-read data alone.
"Costs may become prohibitive for some when working on larger genomes," added Baylor's English, "but there are currently no alternative platforms providing the extremely long and unbiased reads one receives from PacBio — reads which have a very beneficial effect on an assembly's quality."