By Julia Karow
Using a single sequencing platform to analyze a human genome, thousands of true — and potentially important — variants may go unnoticed, according to a recent comparison between the Illumina and Complete Genomics platforms by researchers at Stanford University.
But for reasons of cost, large-scale sequencing projects may not immediately heed the advice of the scientists: to sequence the same sample on both platforms in order to obtain the most complete picture.
Overall, almost 90 percent of variants identified by the two platforms overlapped, and by and large, neither platform did significantly better than the other, though the scientists observed differences in the false-positive and false-negative rates of the technologies. The study was led by Mike Snyder at Stanford's department of genetics.
According to Michael Clark, a postdoc in the Snyder lab and one of the authors, he and his colleagues had expected to find some differences between the two platforms, "but we did not expect there to be hundreds of thousands of variant differences."
"The authors clearly demonstrate that, to see the full picture, it is advisable, if not necessary, to take advantage of more than one sequencing platform," said Stephan Wolf, head of the sequencing core facility at the German Cancer Research Center in Heidelberg. "Combining the pros of different sequencing systems probably is the most effective and economic way of deciphering mutations in human disease."
For their comparison, the Stanford group sequenced two DNA samples from a single individual — one derived from blood, the other from saliva — on the Illumina HiSeq 2000 and by using Complete Genomics' service, respectively, with a total coverage of about 150x per platform. While Stanford's Illumina instruments produced 101-base paired-end reads, Complete generated 35-base paired-end reads on its proprietary sequencers.
The Stanford scientists then aligned the Illumina reads to the human reference genome with the Burrows-Wheeler aligner and used the Genome Analysis Toolkit to call about 3.6 million single nucleotide variants, applying a quality filter from the 1000 Genomes Project. They also used GATK to call about 610,000 insertions or deletions. Complete Genomics used its in-house analysis pipeline to detect about 3.4 million SNVs and about 430,000 indels.
While the overall number of SNVs called from blood DNA and saliva DNA was roughly the same, there were some tissue-specific variants, the researchers noted, but few of these could be validated by other methods, a result they plan to publish sometime in the future.
About 3.3 million of the 3.7 million total SNV calls, or almost 90 percent, overlapped between the two platforms, and various analyses and validation studies showed that these were highly accurate. The picture was quite different for the indels: only about a quarter of all indels identified were called by both platforms. For technical reasons, concordant indels were difficult to sequence by Sanger, but almost all of those that could were validated.
But Complete Genomics also called about 100,000 additional SNVs, and Illumina almost 350,000 that were specific to that respective platform, quite a few of which fell into repetitive regions of the genome. Most of them were not called at all by the other platform, and validation experiments determined that both these sets contain a large fraction of false positives.
For example, spot checks of a small number of the platform-specific variants revealed that only about 15 percent of the Illumina variants could be validated by Sanger sequencing, but almost 95 percent of the Complete Genomcis ones could be. Using another validation method — Agilent SureSelect capture followed by Illumina HiSeq sequencing — the researchers were able to validate about half of 33,000 Illumina-specific and half of 3,000 Complete Genomics-specific variants they targeted.
Though platform-specific variants appear to be error-prone, many of them are still accurate, and some may even be functionally important. For example, the researchers checked how many platform-specific variants were contained in the Varimed database, which contains variants that have been associated with disease, and found 31 Illumina-specific and 3 Complete Genomics-specific variants.
Among those were a SNP in the HTRA1 gene that was only identified by Illumina that has been linked to an increased risk of age-related macular degeneration, and a SNP in a telomerase gene, also only detected by Illumina, that has been associated with aplastic anemia.
Of the about 800,000 indels called in total, about 390,000 were Illumina-specific and about 200,000 were Complete Genomics-specific, and many of these turned out to lie in repeat regions. Most of those that could be analyzed by Sanger sequencing were validated, suggesting that both platforms pick up true indels that the other one misses.
Overall, the researchers found that the Illumina platform is more sensitive than Complete Genomics' approach, detecting more variants, but also generated more false positive calls. Complete Genomics, on the other hand, is more accurate in its calls but also less sensitive, potentially missing some true variants.
According to Dipesh Risal, senior product manager for informatics at Illumina, since the data for the Stanford paper was generated, Illumina's chemistry, hardware, and software have all improved and the company has now optimized the balance between false positive and false negative calls, which "should improve results compared to what the paper has shown." He also pointed out that the Stanford group did not use the same analysis pipeline that Illumina uses for its own human whole-genome sequencing service.
Similarly, Complete Genomics CEO Cliff Reid said that the data from his firm used in the study are about 18 months old, and the biochemistry and analysis pipeline have "evolved considerably", resulting in smoother coverage, improved indel calling, and increased sensitivity. While the results show its platform is more accurate, it "may take a more conservative approach than some others," for example in difficult-to-call regions of the genome.
According to Reid, sequencing the same genome on two platforms as a method to validate variants is "a reasonable strategy" but "may not be the best choice if resources are limited and doing so would reduce study sample size."
Taken together, the similarities between the two platforms are "quite strong," and the results "very similar," said Stanford's Clark. He added that Illumina's and Complete's human whole-genome sequencing services are "very competitive" in terms of price, about $3,000 to $4,000 per genome. "If you had to pick one, you are better off with whichever gives you the best deal and the best turnaround time," he said.
Given that each platform identified variants that the other one missed, the researchers recommend in their paper that "the best approach for comprehensive variant detection is to sequence genomes with both platforms if budget permits," estimating that each additional variant found by using the second platform costs between 2 and 6 cents, based on a total cost of $4,000 per genome.
Alternatively, exome sequencing on top of whole-genome sequencing can bolster accuracy and fill in gaps, at least in the coding regions of the genome, a finding the group made earlier this year in a comparison of exome sequencing methods (IS 9/27/2011). Exome sequencing is still about one-tenth the cost of whole-genome sequencing, "so adding that is not quite as big an investment, and the yield is significant and it's good validation," Clark said, especially if Complete Genomics data is supplemented with Illumina exome sequencing data.
Also, additional sequencing platforms might contribute further variants. For example, Clark and colleagues have data generated on the SOLiD platform for the same individual that they intend to analyze. "Since [these technologies] are [based on] different biochemical assays, they provide opportunities to yield variants that you are going to miss with a different technology," he said.
Wolf said that the German Cancer Research Center has used a combined SOLiD and Illumina sequencing approach for past projects, but did not elaborate on the results. For its PedBrain project, the institute mostly uses the Illumina HiSeq platform at the moment (IS 12/7/2011).
According to Clark, the Snyder group currently makes use of both Illumina's and Complete Genomics' platforms for whole-genome sequencing studies, but usually only one platform per project. For example, the group recently ordered 500 genomes from the Illumina Genome Network sequencing service (IS 10/25/2011).
"We haven't picked one [technology] because we did find that both of them look quite good," Clark said. And despite the fact that using two technologies provides a more comprehensive picture, the total cost of doing so is still too high for large research projects.
He conceded that crucial variants may be missed by not combining different platforms, but said he is hopeful that improvements in variant calling will help each platform detect additional variants.
Have topics you'd like to see covered in Clinical Sequencing News? Contact the editor at jkarow [at] genomeweb [.] com.