Lior Pachter describes comparative sequence assembly with the simplest of analogies: “If you’re working on a jigsaw puzzle, you’re better off working by looking at the front cover of the box so you have a clue where the pieces should go.” Likewise, assembling the tiny fragments of a newly shotgunned genome is much simpler when you have the complete picture of a pre-assembled genome to serve as a guide.
With a growing list of assembled genomes in hand, researchers are finding that comparative genomics can reduce the amount of reads necessary for whole-genome assembly of the next generation of sequenced species. Pachter, an assistant professor in the department of mathematics at UC Berkeley, and others are using comparative approaches as a “front end” for assembly programs such as Phrap or Arachne, and finding that the approach reduces costs and increases the speed of the process.
IP Genesis, a bioinformatics startup based in Houston, Texas, just licensed a commercial version of comparative sequence assembly technology, called CSA, to its first customer — Baylor College of Medicine.
IP Genesis holds an exclusive license to an Argonne National Laboratory patent (US 6,001,562), with the right to sublicense it. The company also owns additional pending patents covering certain aspects of CSA technology.
Vlazny, chief licensing consultant for IP Genesis, said he sees a promising amount of demand for the technology. In addition to its obvious use in large academic sequencing centers like Baylor, the technology could offer benefits for contract sequencing firms looking to reduce the price and increase the accuracy of their services, and may even find use in diagnostic chips designed to detect particular disease variants, he said.
A key characteristic of the technology, Vlazny noted, is that it will get even more effective as more genomes become available “because it can use what’s already known in one species as a platform to help build and sequence the genome for another species.”
Comparative sequence assembly is built on the principle that data from assembled genomes can be used to impose mate-pair distance constraints on new sequence reads. These constraints are fed into standard sequence assembly programs to not only achieve a more accurate mate-pair distance, but to eliminate unnecessary reads.
Explained Pachter, “The idea is to take the whole genome shotgun data from a new genome, assemble it using an assembler such as Arachne, and then you align it and compare it to a finished genome. When you have contigs that are separated by a distance of the type one usually finds between mate pairs, then you build what we call a fake or faux mate pair, which we throw back into the assembler, and [then] reassemble everything from scratch.”
Pachter expects the technique to prove its worth in the assembly of the chimp genome, an “obvious” candidate due to its similarity to the human genome. Pachter said that while some researchers have suggested aligning the chimp and human reads directly, their short length (around 500 base pairs each) and the large number of duplications in the two genomes adds up to an unmanageable number of possible alignments. Comparative assembly, on the other hand, relies on longer contigs, making it “more accurate and able to better leverage the finished genome,” he said.
Pachter said his work is still in the research stage. “Our main goal is to get people thinking generally about comparative assembly … We envision it not just as a tool but as a way to get people to think about what they can sequence and how much sequence they need.” He does plan to release a version of the technology, which he said would work with any assembly program that relies on mate pairs, through his website (http://lemur.lbl.gov/cga), following publication of a paper verifying the approach using simulated reads.
The commercial alternative, however, is already on the market and available from IP Genesis via a straight license or as part of an implementation service that links CSA with third-party or publicly available tools such as Blast, Phrap, or Consed. Vlazny said pricing is flexible, and “depends on what the user wants to do.”