This story was originally published Aug. 20.
Looking for new methods to validate next-generation sequencing results, researchers at the Broad Institute have turned to the Pacific Biosciences RS machine to validate human sequence data from Illumina runs.
The team is also in the process of developing a pooled validation protocol on the Illumina HiSeq that is capable of validating data from large-scale sequencing studies.
The researchers recently published their PacBio validation protocol in BMC Genomics.
"Validation is a very important need," Mauricio Carneiro, a computational biologist in the Program in Medical and Population Genetics at the Broad Institute and lead author of the BMC Genomics paper, told In Sequence.
Methods such as Sanger sequencing and Sequenom genotyping are typically used to validate human next-gen sequence data, said Carneiro, but Sanger sequencing "is very laborious" and interpreting the data "is almost a black art," he said. Meantime, Sequenom genotyping, while "cheap and fast," is "still not automated" and cannot be done on a large scale.
Because most of the sequence data generated by the Broad is Illumina sequence, Carneiro said that it is important to have a different technology to validate the data, otherwise the validation is subject to the same errors and biases as the original data.
So Carneiro's group, which focuses on medical resequencing projects, decided to test the PacBio RS. Not only is it a different sequencing technology, but it has a quick turnaround time and the errors it generates are random and not biased to specific sequence motifs or certain regions of the genome.
To test the PacBio's ability to be used as a validation tool, the team sequenced PCR amplicons from well-characterized genomes from the 1000 Genomes Project.
The amplicons spanned 98 variant calls based on Illumina GA sequence data, 38 of which had been validated as true de novo mutations and 60 of which had been discovered to be false calls using other sequencing technology.
Amplicons were sequenced on both the PacBio and the MiSeq and performance metrics for both were measured. The PacBio demonstrated 97 percent sensitivity and 98 percent specificity, with negative predictive value of 98 percent and positive predictive value of 97 percent. The MiSeq provided 100 percent sensitivity and 91 percent specificity, with a 100 percent negative predictive value and 88 percent positive predictive value.
PacBio correctly genotyped 96 out of the 98 sites, while the MiSeq correctly genotyped 93 sites.
Of the two sites PacBio miscalled, one error was due to reference bias and was also miscalled by the MiSeq, while the other was wrongly called polymorphic and was also missed by the MiSeq and missed on a HiSeq whole-genome sequencing run.
The three additional sites miscalled by MiSeq were due to noise in the MiSeq data.
PacBio is a "great tool for validation," said Carneiro, and the Broad has since used it to validate other human sequence data, such as in a recent exome sequencing study published in Nature. In that study, the team sequenced 98 medullablastoma exomes on the Illumina HiSeq and used the PacBio to validate 20 candidate mutations spanning 48 exons.
One reason why PacBio makes for a good validation technique, said Carneiro, is that its error mode is "absolutely random," which is not true of other sequencing technologies. Illumina, for instance, tends to do worse in GC-rich genomes, while 454 and Ion Torrent both have trouble with homopolymers, he added.
With PacBio, while most of the errors created are insertion errors, the machine performs equally well in GC-rich, AT-rich, and repetitive regions.
Aside from testing PacBio's ability to validate sequence data, the Broad team also evaluated it for variant discovery in human sequence data.
The researchers sequenced 177 kilobases in 61 amplicons across chromosome 20 using both the PacBio and MiSeq. The amplicons contained 225 SNPs that had been validated in a whole-genome sequencing dataset. The SNPs included 43 sites previously validated as high-confidence SNPs from HapMap data.
The PacBio called 197 of the 225 sites, including 38 of the 43 HapMap sites, while the MiSeq called 222 out of 225 sites and all 43 HapMap sites.
The team then manually inspected all the sites that were found to be discordant. The discordant sites were all due either to low sequence coverage or because of reference bias during the alignment.
Of the 28 sites missed by the PacBio, 12 were due to lack of coverage and 16 due to reference bias, while the three sites missed by the MiSeq were all due to low sequence coverage.
Carneiro said the Broad team has since made significant strides in fixing the errors due to reference bias. He attributed the errors primarily to the caller that was used, and he said the Broad is now using a "haplotype-aware" caller that reduces the problem dramatically.
"True variation was being hidden inside insertions because the aligner thought that the insertion had a better score than the true variation," Carneiro said. The haplotype-aware caller, developed by researchers at the institute, takes into account the entire region, or haplotype, realigns the reads, and then makes variant calls.
Nevertheless, Carneiro said that his team at the Broad is using the PacBio primarily for validation purposes and not discovery.
The team has made progress not just on small-scale validation, but also on validating whole-genome or exome sequencing results of hundreds or thousands of samples, Carneiro said.
The Broad group has developed a strategy, dubbed pooled validation, in which they are able to use the Illumina HiSeq, but run in a different way, to validate results.
Carneiro said the method will be published in a peer-reviewed journal and presented at the annual American Society for Human Genetics meeting in November.
The BMC Genomics study in which the team evaluated the PacBio for validation was a first step in answering a larger question. "How can we validate all this data we're generating?" Carneiro explained.
PacBio is good for small-scale studies, he said, but there's still a problem with validation from large-scale studies.
Since the HiSeq is the most cost-effective way to sequence, it was the obvious choice for large-scale validation, but since most of the production sequence is also generated on the HiSeq, the researchers had to tweak the way in which they used the HiSeq for validation.
"We redesigned the way we sequence to capture the error rate accurately," said Carneiro, and then use that model to "determine whether or not we're making mistakes."
For validation, 10 percent of a HiSeq lane is devoted to a sample from the 1000 Genomes Project that has been sequenced to "such huge depth that we're confident that we actually know the truth," Carneiro said.
Then after sequencing, the "error model sample" is analyzed first, and the calls from the sequence run are compared to what's known in order to determine "how many times the sequencer sequenced the right base for that sample," Carneiro said.
For example, if a site in the error model genome is a Q30 base, but after running the sequence there are more errors than expected, making the site a Q20 base, the probability of the machine making a mistake can be recalibrated, he explained.
"We create this concept of quality of the site to determine the probability of the actual data," Carneiro said. "We can pinpoint with great accuracy … the probability of being right on each site."
Next, to apply this to large-scale studies, genomes are pooled. The pooling method works only when trying to validate data on a population scale, and not in cases where researchers would want to validate sequence data from an individual genome, Carneiro said.
"Looking at a population of 1,000 or 10,000, the question is not does patient number 349 have that mutation, but it's more a question of what is the frequency of that mutation?" Carneiro said.
He said that the Broad team has so far validated 57,000 sites across the 1,200 genomes that have been sequenced as part of the 1000 Genomes Project and he said that the specificity and sensitivity of the approach seems to be very high. The Broad will present this data at the ASHG meeting in November.