Researchers at the Genome Institute of Singapore have released new genome-scaffolding software that they claim can guarantee reliable genome reconstructions where other methods fail.
Unlike heuristic-based approaches, which offer "no guarantees on the quality of the solution," the program is based on a combinatorial algorithm "that is guaranteed to ﬁnd the optimal scaffold," the GIS developers, led by Niranjan Nagarajan, note in a paper published in the November issue of the Journal of Computational Biology.
Dubbed the Optimal Paired-End Read Assembler, or Opera, the software focuses on the problem of scaffolding of a set of contigs using paired-end reads. It includes a framework for handling mapping errors and chimeric mate pairs in sequence data; an algorithm to compute gap sizes that align with sequence library constraints; and a "graph contraction procedure" that allows the tool to scale to large datasets.
With these capabilities, Opera provides a "robust solution" to the genome scaffolding problem, Nagarajan said in a statement.
That’s because it "explains/uses as much of the paired read data as possible" and offers "a clear guarantee on the quality of the assembly" while avoiding "overly aggressive assembly heuristics that can produce large scaffolds at the expense of assembly errors," the authors explained in the paper.
GIS is currently using the software to assemble large plant and animal genomes for projects in cancer biology and pharmacology; stem cell and developmental biology; infectious diseases; and human genetics studies.
In a statement, Mihai Pop, interim director of the University of Maryland's Center for Bioinformatics and Computational Biology, described Opera as "the best standalone genome scaffolder available in the community" at present.
Last week, BioInform spoke with Nagarajan about the new scaffolder as well as other bioinformatics efforts at GIS — a research institute under the auspices of Singapore's Agency for Science, Technology, and Research. The following is an edited version of the conversation.
Walk me through Opera's development process.
Opera is designed to take assemblies from shotgun sequencing datasets and scaffold them using mate-pair data. The goal is to be able to take those assembled contigs and get a more complete reconstruction of the genome.
To model this formally, the approach that people take is saying, 'Okay, you get connections between contigs using this mate-pair information and you want to put the contigs in a linear order such that the mate-pair information is respected as much as possible.' The mate-pair information has some noise in the sense that there are sequencing errors and chimeric reads and so it’s not possible to match the data completely, but you want to match it as much as possible.
It's been shown in earlier work that this problem turns out to be computationally intractable in the worst case. In our work we studied what real scaffolding problems look like to identify characteristics of real datasets that can be exploited to speed up the analysis. We exploited the fact that mate-pair data has bounds on the library size and that the graph structure for scaffolding has a special locality property, to design an efficient algorithm in practice.
When we started off doing this, we had mixed ideas about it. We knew we could prove something nice in terms of complexity results but we weren't sure if, when we implemented this, it would turn out to be blazingly fast or horribly slow. In practice it was a happy coincidence that it ended up being blazingly fast and produced very good assemblies. In fact, in all the experiments we have done, Opera has consistently been the better scaffolder. Also, Opera has the nice property that it is guaranteed to report a solution that maximizes agreement with data.
What other genome scaffolders did you compare Opera to?
Bambus, which is one of the first standalone scaffolders; SOPRA, which is similar in spirit to Opera; and the scaffolder in Velvet. The results presented in our paper are largely for small genomes and that was what we were aiming for initially. Since then, we have successfully used Opera on much larger genomes and compared it to other genome assemblers better suited for large genomes, such as SOAPdenovo, and Opera does very well in those comparisons as well.
Does Opera run on a standard compute infrastructure or is it necessary to purchase additional hardware for the tool to work?
It doesn’t require anything special. It’s a C++ program and should run on any system. In terms of memory and CPU time, typically for small genomes, there is not much of a memory problem and the runtime is a few minutes. For much larger genomes, it will probably take between half an hour to a couple of hours. Memory-wise, we do once in awhile require hundreds of gigabytes so if you have a large genome you may need a large memory machine.
How about hybrid assemblies?
Yes, if you have hybrid datasets, for example Illumina paired-end sequencing data and a SOLiD mate-pair library, the contig assembly can be done with a typical assembler and then the mate-pair data can be used for scaffolding with Opera. The only thing Opera needs is a mapper that can take the reads and map it onto the contigs. Once it has coordinates, it can work from there. It's just a matter of a file conversion because Opera has a certain format that it expects for read coordinates.
Currently, it can take output from the mapper Bowtie, but we plan to increase the support to other mappers as well. BWA is definitely one we want to add pretty soon.
Speaking generally now, is this the first publicly available software that GIS has released?
Not really, there are other groups who have released software. In the past, we haven’t been that aggressive about publicizing our programs because some of them are based on datasets that we generate in house and may not have wide applicability but Opera is something that the assembly community would find useful and so we want to make it easily available.
What other bioinformatics efforts is GIS involved in?
We are one of several groups in GIS working on bioinformatics problems and there is a lot of other great work, including work on sequence mapping and assembly, tools for expression and ChIP-seq analysis, phylogenetic analysis, and functional variant annotation.
Our group is now focused on several extensions to Opera, such as handling repeats, and metagenomic and single-cell sequencing data. We also work on other computational problems in sequence assembly, variant calling, and microbial and metagenomic analysis.
What sort of hardware do you have in house to handle your computation needs?
Like other large sequencing centers, we have invested a fair bit in our computational resources, including a medium-sized compute cluster, several large [symmetric multiprocessing systems] and sufficient storage for the data coming out of our sequencers. We also have access to other [high-performance computing] resources available in Singapore.
How much data do you generate daily, weekly, or monthly?
I don’t have a precise number but it is substantial as we have six SOLiD machines, four Illumina GAs, two HiSeqs and a Roche 454 in house and several large projects that leverage these resources.
Are you considering cloud infrastructure?
Yes, we collaborate with the Institute for High Performance Computing in Singapore, which has expertise in this area, to port our tools and pipelines onto a cloud-computing framework. We are also excited about the potential for development of large-scale integrative analysis tools on the cloud as a scientific and pharmacogenomics resource.