Researchers in Michael Waterman’s lab at the University of Southern California have published a new algorithm that promises to extend the reach of optical mapping — a nearly decade-old approach that has so far been limited to viral and bacterial genomes — to human genome analysis.
In a paper in this week’s Proceedings of the National Academy of Sciences, Anton Valouev and his colleagues at USC outline their approach, which borrows elements from DNA sequence assembly in an effort to make optical mapping more tractable.
“To our knowledge, this is the first algorithm capable of producing accurate restriction maps, using single DNA molecules, of very large genomes (such as human or rice) in feasible time, through the leveraging of increasingly available cluster computing resources,” the authors wrote.
Optical mapping was originally developed in the early ‘90s by David Schwartz of the University of Wisconsin, who is also a co-author on the PNAS paper. The method creates whole-genome restriction maps by stretching out and immobilizing strands of DNA on a glass chip. A restriction enzyme cuts the DNA at specific sites, and these DNA fragments are then stained and visualized via fluorescence microscopy. It’s then up to the assembly algorithm to construct the restriction map based on the order, size, and fluorescence intensity of the fragments.
The authors note in the PNAS paper that optical map assembly is inherently more complex than sequence assembly because optical mapping uses single molecules and does not rely on an amplification step, and therefore “cannot benefit from averaging steps intrinsic to bulk measurement techniques used by common DNA sequencing platforms.” Optical mapping data is also highly prone to errors, including false restriction sites, partial digestion, small fragments, sizing errors, and chimeric maps that result from images of ambiguously overlapping DNA molecules.
Nevertheless, optical mapping offers some benefits over other genomic analysis platforms such as sequencing and microarray analysis, because it can identify abnormalities in chromosome structure. The approach is effective in identifying insertions, deletions, translocations, and inversions, for example, making it particularly promising for studying cancer genomes, which are “notoriously rife with aneuploidy and structural aberrations fostered by unchecked genomic instability,” the authors wrote.
The current algorithm used in optical mapping — a Bayesian approach originally developed by Schwartz along with Thomas Anantharaman and Bhubaneswar Mishra of New York University — had “deficiencies” that have limited the approach to small genomes, the authors wrote.
Valouev told BioInform that the existing method “is difficult to apply to genomes larger than bacterial because it requires without extensive additional ad hoc procedures, and without being limited by the running time. And with our algorithm we can do a fully automatic procedure, in a suitable time.”
Valouev, who is now a postdoc at Stanford University, said that he and his colleagues adopted an approach from sequence assembly called overlap-layout-consensus, which first computes all pairwise alignments, or overlaps, of the optical maps. This step, while computationally intensive, enables the rest of the steps in the approach — layout, consensus, and refinement — to run relatively quickly, Valouev said.
“We precalculate all the overlaps and the assembly is sort of this really tiny program on top that works on the overlaps that were already precalculated,” he said.
The overlap calculation is the most time-consuming step in the process, particularly for larger genomes, because it scales quadratically. However, it offers the advantage that “if we want to reassemble, or play with the assembly parameters to change them a little bit, we don’t need to recalculate all the overlaps because they’ve already been calculated. So we can reassemble really fast if we don’t like something about the parameters of the assembly,” Valouev said.
The overall time required for the process is still considerable, however, and demands substantial computational resources. Valouev said that the USC team used a 400-processor cluster for the test runs published in the paper on Y. pestis, E. coli, Y. pseudonana, O. sativa, and human.
In the case of E. coli, the overlap calculation step required 31 CPU-hours, an additional hour for layout and consensus, and another hour for refinement. For human, however, 57,000 CPU-hours were required for the first step, although the next two steps took only an hour and two hours, respectively.
Furthermore, the test run for human covered only 4.6 percent of the genome with 30X oversampling.
“We had limited success with human because the coverage was not sufficient to enable full de novo assembly for that particular data set,” Valouev said. “But I think when we approach new data sets with larger coverage, we can do extended assembly to get close to 100 percent of the human genome.”
Valouev added that it is “computationally feasible” to construct an optical map for the human genome. “I think it’s a matter of just working with deeper and richer data sets, which are undoubtedly coming.”
One potential beneficiary of an improved algorithm for optical mapping is OpGen, which holds exclusive commercialization rights to the technology. The company currently uses software developed by Mishra and Anantharaman, who serve on the company’s scientific advisory board.
“We had limited success with human because the coverage was not sufficient to enable full de novo assembly for that particular data set.”
Colin Dykes, executive vice president of corporate development and CSO of OpGen, noted in an e-mail to BioInform that while the company is currently focused on the clinical microbiology market, where there is little demand for larger-scale genome analysis, it is planning to sell its instruments into the research genomics market, “where potential buyers had expressed considerable interest in using the system to analyze larger genomes.”
Dykes wrote that OpGen is currently working with Mishra and Anantharaman to refine and improve its current software, and added that “any new approaches that might stimulate additional interest in optical mapping are very welcome.”
Mishra also welcomed new research in the field and said that there are a number of “good ideas” in the PNAS paper, particularly in the refinement step, but questioned whether the method would indeed enable human-scale optical mapping.
Noting the considerable time the algorithm required to complete 4.6 percent of the human genome, Mishra said, “I wouldn’t call that successful mapping of the human genome. … If they wanted to show they can do a human map, they should have done a higher oversampling.”
Mishra also noted that he and Anantharaman have made a number of improvements to the GenTig algorithm cited in the PNAS paper, which was developed in 1999 — an eternity in genomics time. “In 1999, if you sequenced a microbe you were on the cover of Science and Nature,” Mishra said.
For example, Mishra said that he and Anantharaman have developed a new version of GenTig called HapTig for genome-wide haplotype reconstruction via optical mapping. A paper on the method was presented at the Pacific Symposium on Biocomputing in 2005.
Finally, Mishra noted, advancements in optical mapping don’t lie solely on the computational side of things. Some drawbacks of the approach, such as sizing errors for restriction fragments, could be addressed through improvements in chemistry. “If they do that, the burden on the algorithm will be much less,” he said.