Looking for new methods to validate next-generation sequencing results, researchers at the Broad Institute have turned to the Pacific Biosciences RS machine to validate human sequence data from Illumina runs.
"Validation is a very important need," says Mauricio Carneiro, a computational biologist in the Program in Medical and Population Genetics at the Broad and lead author of the BMC Genomics paper on the PacBio validation protocol.
Methods such as Sanger sequencing and Sequenom genotyping are typically used to validate human next-gen sequence data, Carneiro says, but Sanger sequencing "is very laborious" and interpreting the data "is almost a black art." Meantime, Sequenom genotyping, while "cheap and fast," is "still not automated" and cannot be done on a large scale, he adds.
Because most of the sequence data generated by the Broad is Illumina sequence, Carneiro says that it is important to have a different technology to validate the data, otherwise the validation is subject to the same errors and biases as the original data.
Carneiro's group, which focuses on medical resequencing projects, turned to the PacBio RS. Not only is it a different sequencing technology, but it has a quick turnaround time and the errors it generates are random and not biased to specific sequence motifs or certain regions of the genome.
To test the PacBio RS's ability to be used as a validation tool, the team sequenced PCR amplicons from well-characterized genomes from the 1000 Genomes Project.
Amplicons were sequenced on both the PacBio RS and the MiSeq and performance metrics for both were measured. The PacBio RS demonstrated 97 percent sensitivity and 98 percent specificity, with negative predictive value of 98 percent and positive predictive value of 97 percent. The MiSeq provided 100 percent sensitivity and 91 percent specificity, with a 100 percent negative predictive value and 88 percent positive predictive value.
The PacBio RS correctly genotyped 96 out of the 98 sites, while the MiSeq correctly genotyped 93 sites.
Of the two sites PacBio RS miscalled, one error was due to reference bias and was also miscalled by the Mi-Seq, while the other was wrongly called polymorphic and was also missed by the MiSeq and missed on a HiSeq whole-genome sequencing run. The three additional sites miscalled by MiSeq were due to noise in the MiSeq data.
PacBio RS is a "great tool for validation," Carneiro says, and the Broad has since used it to validate other human sequence data, such as in a recent exome sequencing study of medulloblastoma published in Nature.