Researchers from Israel and the US have come up with a weighted sample pooling scheme intended to curb the cost of high-throughput sequencing applications ranging from carrier screening to whole-exome sequencing studies.
A simulation study outlining the strategy appeared in the journal Bioinformatics last month.
"Up until now, all pooling was based on mixing equal amounts of whatever it is you're pooling — usually DNA in our context — and sequencing that," the study's first author David Golan, a graduate student in Saharon Rosset's statistics lab at Tel Aviv University, told In Sequence. "But you don't have to pool equal amounts. You can pool different amounts. And that's the major idea behind our paper."
As he and his co-authors reported in the study, the relatively straightforward approach appears to be highly accurate for detecting rare alleles in sequence data and linking them to individuals in the pool, similar to previously described pooling methods that involve placing equal amounts of individual samples in multiple, overlapping pools. But the weighted method seems to be better suited for picking up common variants missed by combinatorial pooling, they found.
"We demonstrate that this approach is not only easier to implement and analyze but is also competitive in terms of accuracy with combinatorial designs when identifying rare variants," Golan and co-authors wrote, "and is superior when sequencing common variants."
In its simplest form, known as non-overlapping weighted design, or NWD, the newly devised pooling method involves placing different amounts of each sample into a single, non-overlapping pool prior to sequencing, so that samples can be linked to a specific variant or set of variants based on the numbers of reads generated.
The weighted approach can also be combined with combinatorial pooling when sufficient sequence coverage is available, the study's authors explained. As such, they argued, weighted pooling appears to be amenable not only to sequencing-based screens for rare alleles in many individuals, but also to sequencing experiments looking at sets of genes or even entire exomes.
"[W]e argue that weighted designs have enough power to facilitate detection of common alleles," they wrote, "so they can be used as a cornerstone of whole-exome sequencing projects."
And while sample pooling tends to conjure up thoughts of case-control studies focused on finding causal variants behind a trait or disease of interest, Golan emphasized that the rise in available sequence coverage is making pooling an attractive option for a variety of sequencing applications.
"We're not only interested in what happens in the cases versus controls," he said. "We want the whole genome, the whole sequence, to be able to associate the sequence with the individual."
After being inspired to pursue new pooled sequencing schemes during a departmental seminar at Tel Aviv University, Golan set out to collaborate with Whitehead Institute researcher Yaniv Erlich, who he called "one of the leading figures in pooled sequencing."
As a graduate student in Cold Spring Harbor Laboratory researcher Greg Hannon's lab, for instance, Erlich helped to come up with a combinatorial pooling scheme dubbed DNA Sudoku for finding rare variants in large cohorts by mixing and matching samples in complex pools that could be sequenced and then decoded to link specific sequences to their sample source (IS 6/9/2009).
In that strategy, a given sample is in a different pool every time that it's sequenced, Erlich explained, and based on the sequence patterns across the pools, it's possible to recover a rare variant and associate it with a specific sample.
"The pools overlap with each other … and there is a really nice theory that can help to decode the samples just based on the pattern of pools," he said.
With the new weighted pooling approach, Erlich told IS, "the amount [of each sample] that you take is like a code," making it possible to work back from the number of reads generated for a pooled sample to figure out which variant or set of variants is associated with a specific sample.
"This paper is a theoretical extension to our previous work," added Erlich, a co-author on the new study. "It's another example of how you can use coding theory in order to do a better experiment … You can really harness the concepts of coding theory to multiplex your samples."
In general, such pooling methods have been spawned as a result of the dramatic rise in sequencing output, he explained, a shift that has increased the importance of preparing samples in ways that efficiently exploit available sequencing capacity.
To that end, combinatorial pooling methods, developed by Erlich's group and others (IS 3/15/2011), have been helping to find rare variants in large cohorts. But creating complex combinatorial pools can be technically tricky, he said, especially when looking at many samples.
With their newest iteration on pooled sequencing, Ehrlich and his team believe they have found a simple pooling method that also happens to get around a limitation of combinatorial pooling: its limited ability to link common alleles to individuals within each pool.
Because the two approaches stem from different branches of mathematics, explained Golan, the ability to find variants drops off sharply for higher frequency variations in combinatorial pooling sequencing experiments but declines much more gradually when a weighted sample pooling method is used.
In other words, whereas combinatorial pooling works well for finding rare alleles, the accuracy of this approach drops off dramatically for variants found in around 3 percent of a population or more, researchers explained. In contrast, the accuracy of the NWD-based method tapers off more gradually, allowing a look at more common variants with a more subtle jump in the error rate.
"It doesn't have this breakpoint in the frequency [of alleles detected]," he said. "It's a more continuous decline, so you can recover the common alleles with 2, 3, or 5 percent error, depending on the coverage."
In the first of their simulations, the team compared the NWD pool to two combinatorial pooling strategies — the Chinese remainder theorem on which the DNA Sudoku strategy is based and the shifted transverse design method — in a carrier screening context, using Tay-Sachs disease as an example.
There, they found that the accuracy of disease-allele detection dropped off relatively slowly at minor allele frequencies above 3 percent for the samples pooled using NWD, assuming 1000-fold coverage depth, 1,000 individuals tested, and a 1 percent error rate.
In contrast, researchers reported, the performance of the combinatorial pooling methods dropped off precipitously when the disease variants were found in 2.5 percent to 3 percent or more in the population, making disease variants indiscernible from wild type at these higher frequencies.
Similarly, they found that adding weighted pooling to either of the combinatorial methods could stretch out the detectable minor allele frequency compared to combinatorial pooling alone, allowing for the identification of slightly more common variants.
That cost reduction is expected to be particularly pronounced for this paired weighted and combinatorial pooling strategy, which the study authors called a hybrid design — an approach that they said is best suited to situations where very high sequence coverage is available and researchers want to find rare variants from a minimal number of pools.
For instance, the team calculated that it would cost less than one-third as much to capture and sequence one million bases to 30 times coverage with the Illumina HiSeq 2000 for 1,000 individuals using a hybrid pooling method than it would to sequence the individuals without pooling, though the reliability is expected to drop off for variants found in more than 4 percent of the population.
"If you have a lot of individuals and you're interested in rare SNPs, then combinatorial pooling or weighted combinatorial pooling — what we call the hybrid approach — would give you a huge reduction in cost," Golan said.
Another option, in some cases, is to use weighted pooling in combination with barcoding, researchers explained — for instance, by barcoding the pools themselves and multiplexing them on the same flow cell.
"At least theoretically, some weighting is always beneficial," Golan said, though he cautioned that pooling is less effective when the region of interest is so big that it's not possible to achieve sufficient sequence coverage.
Still, the team's simulations suggest that even whole exomes could be pooled by the weighted method if researchers are willing to accept a corresponding climb in error rate.
For example, in exome sequencing simulations that involved pooling two samples, the study's authors calculated that the error rate for variants found in 5 percent of the population was as low as 1.3 percent at a coverage depth of 150 times and as high as 5.9 percent at 30X coverage.
At the 80X exome sequence coverage mark, the error rate for these relatively common variants was just over 3 percent, while the estimated cost was roughly half that of sequencing each exome on its own.
"If you do this trick of manipulating the quantities of DNA, you get the whole sequence and you don't have to restrict yourself to the rare mutations," Golan said.
Given the amount of sequence output needed for pooled exome sequencing, such experiments are becoming possible but are still complicated, cautioned co-author Erlich, who said that he believes the current "sweet spot" for weighted pooling schemes is sets of a few hundred genes.
Though the current paper is based on simulations, Golan said some study authors are doing experiments that use the hybrid weighted-combinatorial pooling method to test specific genes in dozens of samples.
He noted that he is also keen to partner with researchers interested in taking a crack at using the non-overlapping, weighted pooling approach either in a proof-of-principal study or for a broader biological analysis.
Although researchers say it should be possible to pair their pooling strategy with any of the high-throughput sequencing platforms on the market, they noted that Illumina instruments have an edge owing to the deep coverage they offer.
"Theoretically, it should work with every sequencing platform," Erlich said. "But, of course, you need high capacity. And I think Illumina is the right method for that."
"If you have lower coverage, there's a chance that you won't sample one of the individuals [in the pool]," Golan agreed. "So coverage is key here."
While it should be possible in principle to link sequences to an individual in a pooled scenario using technologies that offer lower coverage but much longer reads, he added, that is not something that the group has not tested.
"You'd have lower coverage, but you'd still be able to associate each read to the individual because you have longer reads and more variants on each read," Golan said. "But that's sort of a different perspective."
On the computational front, meanwhile, he noted, team members have been working on ways to more efficiently decode their weighted pooled sequence data, since the so-called 'belief propagation' algorithm used for simulations in the Bioinformatics paper appeared to be less than ideal for hands-on pooled sequencing studies.
Since that realization, Golan explained, the researchers have invested time into developing new decoding algorithms and statistical modeling methods to deal with weighted pooling-based sequence data.
"Once we dug into the data and understood the behavior of sequencing better and the behavior of these algorithms better, we were able to come up with something that really does better," he said.
The team is hoping to prepare a paper outlining that computational strategy in the near future and Golan said he expects that the software will be made available to other researchers once the method matures from its current ad hoc state to a more user friendly form.