NEW YORK (GenomeWeb) – Genotyping-by-sequencing is becoming an increasingly popular method for researchers studying crop and livestock genetics, as well as for ecologists studying various plants and animals. The method, which relies on reduced representation sequencing, does not require a SNP chip and can be tailored to suit researchers' needs. However, there has not been an extensive comparison and validation of different GBS pipelines to ensure that they accurately call SNPs, according to François Belzile, a professor at the University of Laval in Quebec.
Belzile said his lab set out to compare various GBS pipelines because there has been a growing number of tools devised by the research community to extract SNP data from sequencing reads, but without much validation of those SNPs.
In a study published in PLOS One this week, Belzile's group assessed seven GBS pipelines, five of which relied on having a reference genome and two that did not. One of the pipelines, Fast-GBS, the researchers developed in their own lab. They also compared results from sequencing on an Illumina instrument and on Thermo Fisher Scientific's Ion Proton platform.
The researchers applied the seven pipelines to data from 24 soybean lines, which they had previously sequenced on an Illumina platform for a separate project. The sequence data "gave us a complete view of the polymorphic sites in those lines … where we have the true genotype," he said. "That gave us an opportunity to do quality-checking of the GBS pipelines."
The researchers sought to set parameters for all seven pipelines that would be as similar as possible, using 42 million Illumina HiSeq reads for each pipeline.
Of the two de novo pipelines, Stacks called half as many SNPs as Uneak, 13,303 compared to 24,743. For the five reference-based pipelines — Fast-GBS, IGST, TASSEL-GBS v1 and v2, and a reference-based version of Stacks — the number of SNPs called varied between 18,941 for Stacks and 54,412 for TASSEL-GBS v1. In addition, both IGST and Fast-GBS could also call indels. Most pipelines took between just over an hour and four hours to run, except for IGST, which took 13 hours. IGST also required around 10 times more memory than the other pipelines, at 240 gigabases.
Next, the researchers looked at the quality of the data that each pipeline produced, both in terms of accuracy and in terms of the amount of missing data.
For the reference-based pipelines, there was a wide range of missing data, from as little as 28 percent for TASSEL-GBS v1 to 57.3 percent for Stacks. For the de novo pipelines, the proportion of missing data was more consistent, with 39.4 percent for Stacks and 41.3 percent for Uneak.
To evaluate the pipelines' accuracy, the researchers compared the SNP calls with those from the resequencing data. Overall, they found accuracy to be above 92 percent for all of the reference-based pipelines except TASSEL-GBS v1, which was only 76.1 percent accurate. Fast-GBS and IGST were the most accurate, at 98.7 percent and 98.4 percent, respectively. Not counting TASSEL-GBS v1, average accuracy for the reference-based pipelines was 95.6 percent. Accuracy for the two de novo pipelines was 93.6 percent for Stacks and 93.9 percent for Uneak.
The high level of accuracy the researchers achieved with the pipelines, aside from the older TASSEL-GBS v1, "was very reassuring," Belzile said, adding that he was somewhat surprised to see such high accuracy even among the de novo pipelines. "Although the de novo tools call fewer SNPs, they do a surprisingly good job of calling them accurately."
With regards to the significantly lower accuracy of TASSEL-GBS v1, Belzile said that wasn't too surprising since one of the properties of that tool is that it trims all the reads to 64 bases. "Using shorter segments, you're much more subject to improper mapping of those reads," he said. Version 2 of the pipeline performed much better, he added.
The team also evaluated the degree of overlap between the pipelines, observing that the SNPs were more likely to be accurate when they were called by more than one pipeline. Fast-GBS was the most likely to call accurate SNPs, even when no other pipeline called a particular SNP. For instance, when comparing SNP calls among Fast-GBS, reference-based Stacks, TASSEL-GBS v2, and TASSEL-GBS v1, Fast-GBS called 3,148 unique SNPs, 97 percent of which were accurate. By contrast, TASSEL-GBS v1 called by far the most unique SNPs, 31,837, but only 65 percent of them were accurate.
Looking closer at the source of inaccurate SNP calls, the researchers found that although some were the result of errors in the pipelines' variant calling, the majority were due to reads mapping to more than one region in the genome.
Finally, the researchers studied the effect of using a different sequencing technology, the Ion Proton. They sequenced all 24 soybean samples on the Proton, but only evaluated two reference-based pipelines — Fast-GBS and TASSEL-GBS v2, which had performed the best — using 38 million reads.
The Proton tended to produce more sequencing errors than the Illumina HiSeq, although Belzile said that each platform had advantages. For instance, he said, although the Illumina HiSeq would be considered the "gold standard," researchers conducting GBS experiments often do not have their own sequencing instrument in house and have to send samples to a core facility. But because GBS experiments use single-end sequencing protocols, rather than the more common paired-end protocols, researchers who send samples out to get sequenced may have to wait in a long queue before the core facility has received enough samples that need single-end sequencing to make the run worthwhile. By contrast, the Proton has a lower throughput and the sequencing run itself is faster, Belzile said, so wait times are often much shorter. "For us, time is often of the essence," he said. "The crop is growing, is going to be harvested, and we need the answer."
Going forward, Belzile said it was reassuring that the Fast-GBS pipeline they developed in their lab performed well. "It tells us we're on the right track and we'll continue working with this tool," he said.
In addition, he said, the study should help assuage any concerns about the performance of GBS. "Some researchers tend to view GBS as a less accurate platform compared to SNP chips," Belzile said. "But when I see the results that we can get with these tools, we're quite close to SNP chips," he said, and GBS has the added advantage of flexibility. "You can tailor the number of SNP loci you interrogate and choose the protocol that suits the needs of the specific study you're doing," Belzile said.