Researchers from the Institute of Clinical Molecular Biology at Christian-Albrechts University in Kiel, Germany, have developed a two-stage mapping protocol to improve SNP calling from multiplexed targeted sequencing experiments.
The method builds on the idea that for targeted sequencing experiments, it is more cost efficient to first multiplex samples and then enrich them, versus enriching each sample separately and then multiplexing them. The two-stage mapping protocol, published this month in BMC Genomics, relies on first doing a local mapping to the target region, and then mapping only those reads back to the reference genome.
The protocol, when applied to exome sequencing data from the 1,000 Genomes Project, reduced analysis time by 12 hours per sample and detected around 5 percent more SNPs, the researchers reported.
The team is now working on methods to automate more of the analysis.
Abdou ElSharawy, the lead author of the paper, told In Sequence that the protocol will work with any sequencing and enrichment technologies and that his group has benchmarked the protocol with other enrichment technologies, includingFluidigm, Agilent SureSelect, RainDance, and Olink Genomics, as well as various whole-exome capture kits.
In the current paper, the group validated the method using hybrid capture technology from Febit and SOLiD sequencing on targeted, multiplexed regions from Escherichia coli and on multiplexed BRCA 1 and 2 regions from HapMap samples.
Then, they applied just the two-step mapping approach to exome sequencing data generated with Agilent SureSelect and sequencing on the HiSeq 2000 from two samples from the 1000 Genomes Project.
ElSharawy said the goal for the first part of the experiment was to see how pooling multiple samples first and then doing the enrichment would impact variant discovery and analytical performance of SNP calling in targeted sequencing experiments.
Pooling before doing the enrichment step cuts down on time and cost, he said, because all the samples can be enriched at once, instead of a separate enrichment or capture step for each sample.
However, once the sequence data is generated, calling SNPs from targeted sequencing data can be tricky. Currently, said ElSharawy, there are two methods to map and call SNPs.
Either all the reads can be mapped back to the whole genome and then SNPs can be called, or the reads can just be mapped back to the target region. Mapping to just the target region is faster, he said, "but the problem is that it forces some of the reads to map there, so it has a high false-positive rate."
On the other hand, mapping back to the entire genome is much more time consuming, and tends to result in a higher false-negative rate.
As a result, the team decided to combine the two methods. "If we map to the target region at the beginning, the process is faster," ElSharawy said. "Then we take only the reads that map to the [target region] and align to the whole-genome reference."
The combination helps to reduce the false positives caused by forcing reads to map to the target region, while reducing the false negative rate.
In the study, the researchers first designed a test model to look at the performance their "multiplex first, enrich second" protocol on E. coli samples, targeting 68 genes. They tested multiplexing at 4-, 8-, 16-, and 20-plex, demonstrating that the protocol yielded comparable results to performing separate enrichment steps on each sample before multiplexing. At the 20-plex level, however, the results were less reproducible.
Next, they tested the protocol on HapMap individuals, targeting the BRCA 1 and 2 genes. The group tested four different multiplexing levels: 4-, 8-, 16-, and 20-plex. The enrichment design captured just 54 percent of the BRCA 1 and 2 genes, which the authors attributed to the hybridization-based bait designs. In the future, "longer capture baits and iterative refinement of bait design would be required for such genomic regions with low complexity."
Then they tested their two-step mapping technique. The approach led to "more valid SNPs being detected and a more than two-fold speed-up" of the analysis, the authors wrote.
Additionally, the technique discovered four genotyping errors in the HapMap data and also found up to 4.4-fold more known SNPs than have been published in the HapMap chip data.
Finally, the group tested just the mapping technique on publicly available sequence data from the 1000 Genomes Project. On two exome sequencing samples the team demonstrated that the two-step mapping technique reduced compute time to 10 hours from 22 hours per sample, and increased the number of SNPs called by around 5 percent.
"It saves money, time, and we can get more SNPs that are not called or detected if you use only the whole-genome mapping approach," said ElSharawy.
One current limitation of the technique, said ElSharawy, is that some of the reads have to be manually inspected, which is time-consuming. However, the team has developed a method to reduce the amount of manual inspection and he expects a paper on the approach to be published soon in a peer-reviewed journal.
The new software tool will help "refine the reads" to "reduce the number of variants that have to be validated," he said. He declined to go into detail, but said the tool essentially re-orders SNPs based on the probability of each being a real read or a sequencing artifact.
Additionally, he said his team plans to publish papers testing other enrichment technologies and sequencing platforms.