By John S. MacNeil
Almost five years ago, the Department of Energy’s Joint Genome Institute jolted the sequencing establishment when it decided that generating high-quality draft versions of microbial genomes would in many cases be a better use of its money than sequencing every genome to completion. The rationale, according to then-director Elbert Branscomb, was that JGI could assemble four draft sequences — essentially whole-genome shotgun sequence data assembled into contigs — for the same price as sequencing one genome to completion.
But now the game is changing. As scientists at TIGR, Washington University in St. Louis, and elsewhere have developed new methods and algorithms for automating the tedious process of closing gaps and verifying DNA sequence, finishing a genome in one fell swoop might not be so costly after all. In fact, it’s proving more expensive to reopen a draft genome and finish it later than to just finish it right away. Sequencing centers are producing ever higher-quality drafts, making the finishing process significantly easier. And TIGR has intensified its finishing efforts, moving most of its draft sequencing to a new facility, allowing the genome closure operation to take over much-needed lab space.
Meanwhile, the argument that a finished genome provides more valuable data has not wavered. While JGI’s draft sequences have certainly proved a fast and useful tool for delivering data to wide swaths of the microbial community, many of these same researchers say that ultimately the most insightful analyses require finished genome sequences anyway.
The upshot is that the strategy for stopping sequencing at the draft stage may be going the way of the ABI 377. Even officials at JGI acknowledge they’re planning to phase out the practice of stopping at draft sequences, except in cases such as the sequencing of multiple strains of the same organism, when a draft version will suffice. The result may be that microbial researchers spend less time stewing over how and when their sequence will assume a more robust and high-quality form, and more time performing higher-order analyses, such as identifying paralogous gene families and designing PCR primers for microarrays.
Finishing by Remote
How has finishing a genome become so much cheaper in five years? Bioinformaticists and sequencing center directors say that the technology for automating finishing procedures such as gap closure has advanced rapidly — even more rapidly than progress in making draft sequencing more efficient. As a result, finishing costs have dropped precipitously, with researchers at TIGR claiming that generating a closed microbe genome now costs 8¢ to 9¢ per base pair, on average just twice that of producing a draft sequence at 3¢ to 4¢ per base pair, compared to three times the cost of draft sequencing several years ago. So using TIGR’s numbers, sequencing a 3 megabase genome to completion now costs at most $180,000 more than it does to produce a draft sequence.
The numbers are still contested, though, and it doesn’t help that scientists use various definitions for “draft” and “finished.” Definitions for a finished genome range from the “full Bermuda” standard, which Branscomb and Paul Predki described in the Journal of Bacteriology last December as data with an expected base-calling error rate of less than 10-4 and no gaps or other errors “that mortal efforts could remove,” to Steven Salzberg’s insistence that finished should in fact mean that every gap is closed and every base pair is 100 percent accurate.
Branscomb and Predki, who added that draft sequence data can range “in current practice” from approximately three-fold coverage in short (<400bp), uncorrelated reads, to 10-fold or more in long (>600bp), “paired-end” reads of mixed separation lengths, placed the cost of producing a sequence finished to full Bermuda standards at 10¢ per base pair, four times the cost they give for producing a “high-quality” draft sequence. Under this scenario, a finished 3 megabase genome would cost $300,000, compared to $75,000 for a draft.
Branscomb stresses that these numbers vary over time. “It is overwhelmingly the case that this cost ratio is just not a well-defined number, absent agreement on what ‘draft’ means, on what ‘finished’ means (less of a problem, but still not trivial), on how costs are really to be assessed, and absent a lot of averaging over projects of different ‘difficulty,’” he writes in an e-mail. “And even [with agreement], only a snapshot would be had of a rapidly changing reality.”
But Predki, a former JGI associate director now at protein microarray company Protometrix, acknowledges that finishing costs have decreased rapidly. “My read on average is that for the centers that have really focused on finishing over the last couple years, the finishing costs have probably decreased more rapidly than the drafting costs,” he says. “But they’ve certainly both decreased.”
One of the tools making finishing easier was invented at WashU in Bob Fulton’s finishing group. The process, known as prefinishing, picks out regions of low sequence quality, such as near the ends of contigs, and queues up the targeted “walking” reactions necessary to bridge the sequencing gaps. Fulton says the program, along with other efforts to efficiently manage the finishing process, has reduced the workload of the finishing staff by 75 to 80 percent.
Other researchers have made similar progress, and predict the trend toward cheaper finishing is likely to hold as they continue to implement code in the near future. Currently, bioinformaticists at TIGR routinely check for discrepancies between overlapping sequence reads by hand, but Salzberg, TIGR’s senior director of bioinformatics, is testing an automated editor that scans the consensus sequence for errors in base calling. When the algorithm, designed by TIGR scientist Pawel Gajer, finds overlapping reads that don’t match, it re-analyzes the underlying chromatogram data to identify and correct the source of the anomaly. “Right now we have people doing it,” Salzberg says, “but in three or six months there won’t be” — making finishing cheaper still.
And as the cost of finishing continues to decline, Salzberg says the debate over draft versus finished sequencing will gradually fade into the noise. “We’re working hard at TIGR to make completion of genomes faster and cheaper and we’re making rapid progress on that,” he says. “So I’m hoping the point will become less interesting as time goes by because we’ll be able to finish a genome almost as fast as we can shotgun, and maybe for only slightly more money.”
More Refined and Polished Drafts
Ironically, innovations that make draft sequencing more efficient and of higher quality also make finishing less tedious, which could weaken the argument for stopping at a draft. Scientists at large sequencing centers such as JGI and TIGR have developed new laboratory methods for improving sequence coverage by enlarging the size of the pieces of DNA inserted into E. coli clones. With longer clone inserts, assembling the pieces of DNA into larger contigs becomes more manageable and less time-consuming, improving the quality of the overall draft sequence.
“We are greatly increasing the quality of our draft, which not only gives you a better product, but also makes finishing easier because you have fewer gaps,” says Paul Richardson, manager of functional genomics at JGI. “The more gaps spanned in clones, it just improves the process overall, so we’ve actually made strides getting closer to the closed genome just by making improvements in the draft. That obviously will bring the price of finishing down significantly as well.”
Further impetus for moving away from the draft sequence strategy comes from the difficulty in trying to finish a sequence once a draft is laid aside. Reassembling the clone libraries and data at another laboratory is no trivial task. Rick Myers, director of Stanford’s human genome center who has a contract with JGI to perform much of their finishing work, says aligning the sequencing and finishing processes for a given genome is preferable because finishing “is a little easier if it’s done right away.”
In their paper supporting the draft sequencing strategy, Predki and Branscomb admit that “it may be substantially more expensive on average to finish draft sequence data later, should it prove desirable, than to do so at the start and in the same laboratory.”
Who wants a draft when you can get it finished?
To be fair, many microbial researchers are happy to access genome information in any form, since they can identify most genes with a good quality draft sequence. David Wilson, a molecular biologist at Cornell University using Thermobifida fusca genome sequence data as part of his research into cellulose degradation, says that for his interests the draft sequence is just fine. “I’m trying to find the genes that are involved in [cellulose degradation] and look at their regulation,” Wilson says. “The sequence is good enough that I had no trouble finding all the genes we had already identified, and then identifying some other related genes in the genome.”
But even some who are managing fine with a draft still wish they could work with a finished sequence. John Dunn and Daniel van der Lelie, researchers at Brookhaven National Laboratory, agree that the draft sequence of their pet microbe, Ralstonia metallidurans, has proven quite valuable. With the draft, the two researchers have identified genes associated with proteins isolated from a proteomics study, and designed the primers necessary for expressing the genes in G7 expression systems.
However, Dunn and van der Lelie, like many of their colleagues working with draft sequences, are constrained in the kinds of higher-level analyses they can perform, and hope the draft sequence isn’t missing any genes. They say a finished sequence would allow them to investigate the response of R. metallidurans under stress conditions, as well as explore the crosstalk between the megaplasmids in the microbe’s genome. Says Dunn: “Having the complete sequence would be extremely useful for looking especially at regulatory regions, where we want to make sure that minor differences in promoter sequences are significant, and for assurance that these are real.”
End of the Draft?
Assuming the cost of finishing continues to decline, even proponents of the draft sequencing strategy acknowledge that the benefits of sequencing most new genomes to completion may soon be worth the cost. “If it’s economical to do finishing, then that’s definitely something we would consider,” says JGI’s Richardson. “We are working towards that goal of being able to finish all of our genomes; we just want to do it in a cost-effective manner.”
Although Richardson says there will still be cases, such as when sequencing strain variants, in which draft sequencing will be all that’s necessary, he says the momentum toward finishing is growing. “I think we will reach the goal of matching draft and finishing within two years,” he says. “We’re working toward that, but you’re still going to have the bottleneck of needing people-time to put the real finishing touches on and to make sure everything is OK.”
Cut the Middleman to Cut Costs
One of the main arguments for finishing genomes immediately is the high cost associated with reopening a draft genome — particularly if the group working on it isn’t the same group that did the draft. The number-one cost: obtaining and tracking all the templates used for the sequence.
But thanks to work by Dick McCombie, director of the genome sequencing center at Cold Spring Harbor, that may not be an impediment for much longer. A significant player in the rice genome effort, McCombie says, “It became clear almost two years ago in the rice project that there were going to be some groups drafting a huge amount more than they could finish. Initially people talked about mailing boxes of templates around, but that was totally impractical.”
McCombie and his team set out to develop finishing methods that don’t rely on the original templates. With rice, for instance, using those templates meant handling and tracking up to 3,000 subclones per BAC; with some 3,500 to 4,500 BACs covering the genome, that would have been a massive chore. McCombie’s crew lets researchers use easily accessible BAC libraries instead, and hopes one day to use just genomic DNA. “That simplifies the number of physical objects that you have to keep track of,” he says. In theory, anyone using his techniques could finish targeted regions of a draft genome at the same cost level that the founding team could do it.
McCombie’s crew has come up with algorithms that can work with many different finishing methods, from direct sequencing to long-range PCR. The software studies existing data and proposes regions to resequence or even picks primers from the BAC, McCombie explains. Though the software was designed for the rice genome, he says the algorithms are highly portable to other organisms — and might make a particularly good way to finish regions of individual human genomes.
Though Cold Spring doesn’t have the resources to provide support for the software, McCombie says the programs are available to the community. “We’ll mail it to anyone that wants to use it.”
— Meredith Salisbury
How Much Coverage is Necessary?
Eric Green, director of the NIH Intramural Sequencing Center, is trying to put to rest the draft-versus-finishing debate once and for all. As part of an ongoing project, Green’s group is collecting sequence data at various levels of maturation on multiple mammalian species to determine their associated costs and benefits.
“We are sequencing targeted regions of genomes in many different species,” he says. “It’s perfect for doing this exact kind of study, where you analyze that data to try to figure out whether it’s really worth taking this to pristine, perfect sequence or whether I could have learned everything I needed to learn with something slightly less.”
Although Green declined to elaborate on the details of his analysis, and says he doesn’t have plans to publish his results for another six months to a year, he adds that NHGRI and large sequencing centers are interested in his work. “All their eyes are on us,” he says. “They’re very interested to hear it because they’re going to use some of our data as arguments for how they want to finish their third, fourth, and fifth genome they’re sequencing.”