An international team of scientists has published an algorithm for assembling chromosomes from next-generation sequence reads that uses reference genomes from related species to order scaffolds from newly sequenced organisms.
In a paper published recently in Proceedings of the National Academy of Sciences, the developers explain that the Reference-Assisted Chromosome Assembly, or RACA, algorithm uses information about chromosome organization from previously sequenced species as a reference from which to build chromosomes for the newly sequenced species.
Using related species as references, RACA is able to properly assemble scaffolds from newly sequenced organisms into chromosomes even though it has no prior information or “physical map” that shows how the scaffolds are arranged on the chromosomes, the researchers explained. In addition to related species, RACA relies on the genomes of outgroup species to resolve ambiguities in the assembly.
Jian Ma, a co-author on the paper and an assistant professor of bioengineering, biophysics, and computational biology at the University of Illinois at Urbana-Champaign, explained that RACA is more balanced in its assembly approach than existing reference-based methods — such as ABACAS and OSLay — because it uses genomic information from several species to arrange the sequence reads. Using the reference for a single species, he noted, can result in biased assemblies.
Basically, RACA uses information from related species along with paired-end reads from the new species to “piece [the fragmented] scaffolds together” into chromosome-scale assemblies, he explained to BioInform.
A more detailed explanation from the PNAS paper states that RACA “uses comparative evolutionary inference together with support from paired-end reads in de novo-generated scaffolds to reconstruct chromosomal architecture with high accuracy.”
The researchers note that RACA’s “framework is generic enough to accommodate other available information, such as known [evolutionary breakpoints] and partial genetic mapping data.”
Ma and colleagues say that RACA could be incorporated into assembly workflows to complement the work of existing de novo assemblers because it can use scaffolds generated by these algorithms to create complete chromosomes.
As an example, the PNAS paper describes how the researchers used SOAPdenovo sequence scaffolds to assemble data from the newly sequenced Tibetan antelope genome, using the cattle genome as a reference and the human genome as an outgroup.
The development team believes that the program could help scientists better assemble newly sequenced genomes. In particular, the paper states that RACA would be a useful tool in ongoing efforts such as the Genome 10K project, which aims to sequence the genomes of 10,000 vertebrate species; and the i5K initiative which plans to sequence the genomes of 5000 insect and related arthropods.
It could also shed some light on “phenotypic evolution” by helping researchers understand “how chromosomes are organized in one species relative to other species,” Harris Lewin, a professor of evolution and ecology at University of California, Davis and an author on the PNAS paper, said in a statement.
Speaking with BioInform, Lewin, who is also vice chancellor for research at UC Davis, explained that his laboratory’s research into evolutionary breakpoints — regions where chromosomes rearrange — has shown that that these areas are “very rich in genes and gene duplications that are associated with adaptive evolution such as immune response, reproduction, and olfactory receptors.” The group has published papers on the subject in Science, Genome Research, and elsewhere.
“Our goal is to able to better identify breakpoint regions that are associated with the evolution of specific lineages on a comprehensive scale, which will allow us to correlate some of the major adaptive changes that have occurred along those lineages with chromosome rearrangements,” Lewin explained.
Tools like RACA make it possible to analyze chromosome organization in multiple species simultaneously “especially those that are representative of the major clades in the mammalian tree phylogenies ”for instance — and by doing so “understand how these rearrangements have contributed to evolution of traits and phenotypes,” he explained.
He added that the researchers have also released a comparative genome browser that allows users to visualize how chromosomes are organized in different species and look at “genome rearrangements on a chromosome-by-chromosome basis.”
Ma told BioInform that the developers intend to work on making future releases of RACA more efficient and scalable as the number of species requiring assembly grows.
The developers acknowledge that chromosome assemblies will get better as NGS technologies evolve to produce longer sequence reads. But they also argue that lengthier reads won’t address the problem entirely and as such there is still a place for algorithms like RACA.
“If you had very long reads of 1,000 or 5,000 bases or longer … theoretically it’s possible that you can create assemblies that will be almost chromosome size in length,” Lewin said.
However, “as of this moment, we really don’t have any reliable method … that I know about … that would create scaffolds of a size that would be useful for the evolutionary analysis that we do,” he said. ”In the future that may be the case but it’s not the case right now and with the number of genomes that are being created, we need methods right now to be able to do this.”
Steven Salzberg, a professor of medicine and biostatistics at Johns Hopkins University School of Medicine, described RACA’s approach as a “good idea” but “not a new one.”
He pointed out that other groups have published methods that are technically similar to RACA. For example, he and colleagues from the Institute for Genomic Research published a paper in 2004 describing an assembler called AMOS-cmp, which applies an approach akin to RACA’s to Sanger reads.
In the supplement section of the PNAS paper, Ma et al. acknowledge that RACA isn’t the first of its kind.
However, they argue that their method differs significantly from its predecessors because it proposes “a novel framework to specifically consider the tree topology and branch lengths of the phylogeny when computing the posterior probability of adjacencies in the target genome in the context of genome evolution.”
They also contend that they are proposing “a new model to consider both phylogenetic comparative information and paired-end read mapping,” and that the method can be used to “detect and correct mis-assembled scaffolds."