Researchers performing human genome resequencing studies have an ever-expanding range of technologies at their fingertips — including short-read sequencers, longer-read platforms, and microarrays — but determining the optimal recipe of different systems that will generate the most useful information at the lowest overall cost is still a guessing game for many labs.
In an attempt to help researchers address this challenge, a group of researchers at Yale University has developed a "simulation toolbox" called ReSeqSim that can determine the combination of different technologies that will generate the most accurate assembly at the lowest cost, particularly for studies that are focused on reconstructing large structural variants.
"I think for a lot of people the thought process is really to use one particular technology — that you can do everything with Solexa, with 454, and so forth," Mark Gerstein, one of the method's developers, told In Sequence. "But we've been thinking that maybe there's some way of combining the technologies so that you could achieve a better outcome — or a quicker, lower cost outcome," than with one platform alone.
While some studies have used a combination of different platforms to analyze human genome data — such as the recently published Korean genome for which researchers at Seoul National University used a combination of whole-genome shotgun sequencing, targeted bacterial artificial chromosome sequencing, and microarray-based comparative genomic hybridization analysis (see In Sequence 7/14/2009) — there is currently no way for researchers to determine the best experimental design for such work.
This challenge is particularly acute for structural variant analysis. As Gerstein and his co-authors explain in a paper describing the method that was published last month in PLoS Computational Biology, "At one extreme, performing long Sanger sequencing with a very deep coverage will lead to excellent results at high cost. In another, performing only the inexpensive and short Illumina sequencing may generate good and cost-efficient results in SNP detection, but will not be able to either unambiguously locate some of the [structural variants] in repetitive genomic regions or fully reconstruct many of the large SVs."
In particular, issues arise when looking for structural variants larger than 3 kilobases because this "requires the integration of reads spanning a wide region, often involving misleading reads from other locations of the genome," the authors wrote.
Due to the large number of repeats and duplications in the human genome, "a set of longer reads will be required to accurately locate some of these SVs in repetitive regions, and a hybrid resequencing strategy with both comparative and de novo approaches will be necessary to identify genomic rearrangement events such as deletions and translocations, and also to reconstruct large novel insertions in individuals."
Gerstein said that he and his colleagues set out "to come up with some principles, or a systematic way to think about that, and to develop a computational framework or statistical framework so that you could come up with the most optimal combination of technologies" for human genome resequencing and analysis.
The resulting system, available here, simulates the sequence assembly process in order to determine the optimal combination of long, medium and short reads, the best use of single and paired-end reads, and the extent to which comparative genomic hybridization arrays will achieve a specified level of performance at a given cost.
[ pagebreak ]
Rather than compute all possible technology combinations for the entire genome assembly process — a task that would require hundreds of millions of CPU hours — Gerstein and colleagues rely on a concept called the "mapability map," which essentially precomputes how often a given k-mer occurs in the genome. This provides a rough approximation of the repeat content in the genome, so that the method only needs to simulate a representative segment — say a large insertion — and then extrapolate those findings across the entire genome.
"We take the entire genome and we precompute how often something occurs in the genome," Gerstein said. "Then, when we go to simulate the region of interest, we do a biased simulation where we generate more simulated reads from the regions that have high mapability, and that reflects the fact that when you're drawing reads randomly from the entire human genome, of course you're going to draw more that map to [a repetitive region] because you're drawing from many places in the genome."
Gerstein estimated that the mapability map approach speeds up the computational time by 100,000-fold, which means that a researcher could easily simulate a thousand different technology combinations in about one CPU hour.
As an example of how the tool can be used in practice, the authors simulated the reconstruction of an insertion of around 10 kilobases using a combination of long, medium, and short reads with an assumed cost of $7 for the reads that cover the insertion. In order to achieve "optimal performance," according to the simulation, the experiment would require around 0.05-fold coverage with long (Sanger) reads, around 7-fold coverage with medium (454) reads, and around 12-fold coverage with short reads (Illumina or SOLiD).
Gerstein noted that the approach assumes a fixed cost. "In a sense, the best situation is always going to be more and more reads, but here you have a situation where you're fixed, so if you get more 454 reads you have to have [fewer] Solexa reads."
In addition to long, medium, and short reads, ReSeqSim can simulate different combinations of single or paired ends, and can also throw arrays into the mix. In one example, the authors used the tool to determine that a CGH array's ability to accurately detect a deletion is "comparable" to 16-fold coverage with short-read sequencing, but at much lower cost — approximately $1,000 for an array as opposed to around $300,000 for sequencing.
Gerstein said that the toolkit was designed to adapt to the rapid pace of sequencing technology development, so parameters such as cost, read length, throughput, and the like can all be updated as needed and new technologies can be added as they become available.
He explained that one driver for the project was a "debate" underway among participants of the 1000 Genomes Project regarding how much of the project's focus should be on structural variation analysis as opposed to SNP calling.
"We've been using the simulation toolbox a bit in kind of advocating different strategies in 1000 Genomes," Gerstein said. "I think we've gotten people to agree to do some additional 454 and long-read sequencing on select individuals because of the obvious demonstration of the fact that the reconstruction of the structural variants will be improved."
He acknowledged that there are good reasons why the 1000 Genomes Project would be reluctant to plunge neck-deep into structural variation. "One of the reasons is that no matter how you do your simulations, at the end of the day you're going to have to do more sequencing to do reconstruction of structural variants," he said.
Nevertheless, he believes that a happy medium can be struck that provides both types of information for the project. "One of the things we show in our paper is that at any time when you're able to reconstruct the structural variants, you're always able to call the SNPs," he said. "So if you come up with an optimal structural variant reconstruction strategy, you'll also be able to call SNPs."
He noted, however, that for those who just want to call SNPs, "you're obviously best off just to do short-read sequencing."
Ryan Mills, bioinformatics team leader in Charles Lee's lab at Brigham and Women's Hospital, told In Sequence that ReSeqSim appears to be a "useful and necessary sort of tool."
[ pagebreak ]
While the Lee group is known for being an array-CGH lab, "we're starting to move into more sequencing projects and it would be helpful to be able to plan ahead [to know] what we should focus on and how that fits in our budget," he said.
Mills noted that his team hasn't used ReSeqSim yet, but is considering it for future grant proposals. "You want to get most bang for your buck, and being able to specifically choose platforms that will give optimal return would definitely be of interest."
Mills said that participants in the 1000 Genomes Project are finding that "no single technology is going to give you everything," and that many labs have found that even SNP calling can be improved with a combination of different platforms.
While it would be nice to "complement everything with array-CGH," Mills acknowledged that would be "ridiculously not cost effective," and that it would be helpful for many groups to know exactly what technologies they need to generate the results they require.
While he couldn't vouch for the accuracy of the simulation approach, he noted that it's "still more powerful than nothing at all."
Gerstein said that ReSeqSim is one piece of a broad "simulation toolbox" under development in his group for next-generation sequence analysis. The first of these tools, published in PLoS Computational Biology last year and available here, was focused on ChIP-seq experiments; while the second, published in Genome Biology in February and available here, addressed resequencing with paired-end data.
A simulation tool for transcriptome analysis is next on the agenda. "The transcriptome is really the next frontier," Gerstein said. "I would say it's significantly harder than the genome and ChIP-seq."
Gerstein noted that alternative splicing, the wide dynamic range of expression data, and "the ambiguity as to exactly the full extent of what you're reconstructing" make transcriptome analysis much more difficult than other applications, "so how you combine the technologies becomes a bit trickier."