NEW YORK (GenomeWeb) – Researchers from St. Petersburg University of the Russian Academy of Sciences and the University of California, San Diego have published a new method of reassembling bacterial and other genomes based on assemblies from related species.
The researchers claim that the method, called the Reference-Assisted Genome Ordering Utility (Ragout), addresses some of the shortcomings of other methods that use a similar approach.
The developers described Ragout's approach and provided examples of its application to bacterial genomes including Escherichia coli and Helicobacter pylori strains in a recently published Bioinformatics paper.
"In simulations as well as real datasets, we believe that for common bacterial species, where many complete genome sequences from related strains have been available, the current high-throughput short-read sequencing paradigm is sufficient to obtain a single high-quality scaffold for each chromosome," the authors wrote in the paper. .
Ragout, they explained, makes use of "multiple references along with the evolutionary relationship among these references … to determine the correct order of the contigs." It also uses "the assembly graph and multi-scale synteny blocks to reduce assembly gaps caused by small contigs from the input assembly."
Ragout takes as input a target assembly, a set of related genomes, and a phylogenetic tree that contains both the target assembly and the references. Full details of how it works are provided in the paper but the gist of it is as follows: once Ragout gets the data, the system uses Sibelia — a separate software package created by Son Pham, a research associate at the Salk Institute for Biological Studies and one of the co-authors on the Ragout paper, and other colleagues — to break down the input sequences into synteny blocks. It then applies genome rearrangement techniques to infer missing adjacencies that occur as a result of the aforementioned sequence fragmentation step. Lastly, the system assembles the contigs into scaffolds.
"The above procedure is repeated multiple times with different synteny block scales, and the resulting scaffolds in these iterations are reconciled into a single set of scaffolds," the paper states. "Afterwards, a refinement step is performed [where] small and repetitive contigs are recovered and inserted back into the scaffolds by using the adjacency information from the assembly graph."
The reference genomes that Ragout uses for its assembly do not have to be complete ones, Pham, who developed the software along with SPbAU masters student Mikhail Kolmogorov and other colleagues, told BioInform. If completed genomes aren't available, it can use incomplete references to assist the target assembly. Also, although the paper only explores Ragout's application to bacterial genomes, Pham says that the software isn't restricted to a single species. In fact, researchers in a separate study have used the software to assemble mammalian genomes, he said. Lastly, there's also no limit to how many references can be entered into the system — in this case, more is better.
Ragout offers a cheaper alternative to some of the more costly and labor intensive solutions that have been used to improve assembly quality, such as using longer reads from Pacific Biosciences sequencers or using jumping libraries to connect small contigs into larger scaffolds, the paper states.
This multiple reference approach to genome assembly also takes care of scaffold errors caused by the presence of structural variants and genomic rearrangements, Ragout's developers claim. Errors due to structural variants were a problem for early reference-assisted assembly tools which simply lined contigs up against the reference and then "ordered them according to their positions in the reference genome." An attempt to correct this problem, dubbed "the contig ordering problem," was published in 2006. Here, contigs were ordered in such a way that "the 2-break distance (DCJ distance) between the resulting scaffold and the reference genome is minimized." But that update didn't fix the genome rearrangement problem, according to the current paper.
Another method published last year, called the Reference-assisted Chromosome Assembly (RACA), reassembles genomes based on a related reference and genomes from outgroup species to resolve ambiguities. In one example from the RACA paper, which was published in Proceedings of the National Academy of Sciences, the researchers used SOAPdenovo scaffolds to assemble the Tibetan antelope genome with the cattle genome serving as the reference and the human genome as an outgroup.
RACA's approach is useful but limited, Ragout's developers write in their paper. Aside from relying on a single reference for the reassembly, it "constructs synteny blocks based on pairwise sequence alignment against only the reference genome" instead of using all of its input sequences. It's an approach that does not work in all cases, and does not address the problem of sequences that do not align to the reference, the researchers said.
Furthermore, RACA does not address questions around what scale to use when constructing synteny blocks, the researchers wrote. A large scale can lead to gaps in the assembly — since smaller synteny blocks within smaller contigs are ignored, according to the paper. On the other hand, smaller synteny blocks make rearrangement analysis more difficult because these blocks are "more likely to exhibit structural variations and are also more susceptible to be incorrectly identified," the researchers wrote.
As explained in the above summary of Ragout's approach to assembly, the software uses multiple scales to generate multiple scaffolds and then reconciles them into a single set of scaffolds. The Sibelia software — which is designed specifically for bacterial genome data — has a set of optimal parameters that researchers can use for synteny block construction, Pham said. Ragout has also been extended to work with synteny block generation tools that are used for mammalian genomes, for example the Cactus aligner developed by researchers at the University of California, Santa Cruz.
Future development plans also include enabling the software to use de Bruijn graphs in its rearrangement analysis. Right now, Ragout uses assembly graphs only to recover "repetitive blocks or small contigs that could not be captured in synteny analysis," the researchers wrote. "Therefore, it can make mistakes when rearrangements happened on the target branch." There are also ongoing efforts to make the system more user-friendly and easier to install, Pham said.
Ragout's developers will present the software at this year's Intelligent Systems for Molecular Biology conference which will be held in Boston next month.