This article has been updated with a reference to a previous study from Stanford University.
Exome sequencing at high coverage is preferable over whole-genome sequencing at lower coverage for identifying pathogenic mutations in coding regions and resolving ambiguous results, according to researchers at Baylor College of Medicine.
In a study published online in Genome Medicine last month, the scientists sequenced the exome of Jim Lupski, a professor of molecular and human genetics at Baylor, using four different sequencing platforms, and compared the results to an analysis of his genome they had published three years ago in the New England Journal of Medicine, a study that identified the molecular cause of a genetic disease afflicting him and his family, Charcot-Marie-Tooth neuropathy (CSN 3/16/2010).
Exome sequencing identified an additional variant in a gene that likely contributes to Lupski's disease phenotype that was missed in the original whole-genome analysis. It also helped to resolve several ambiguous incidental findings of the original study.
As long as high-coverage whole-genome sequencing remains costly and time-consuming, exome sequencing, despite its shortcomings, is the method of choice for identifying disease-causing variants in coding regions, the researchers concluded.
"Our data suggest that the choice of [exome sequencing] for identification of fully penetrant critical SNV mutations in the coding regions of mammalian genomes may be regarded as superior, rather than a shortcut, or compromise, when compared with [whole-genome sequencing] approaches at depths of coverage below 100x," the authors wrote, noting that whole-genome sequencing "continues to be a more comprehensive next-generation sequencing approach to total variant detection in a personal genome if one seeks to capture the majority of the genome-wide variation including [copy number variants] and other structural variants in addition to [single-nucleotide variants.]"
A study by scientists at Stanford University two years ago essentially came to the same conclusion - that exome sequencing picks up variants a typical whole-genome sequencing experiment misses (IS 9/27/2013).
"We still think that whole-genome [sequencing] gives you a lot more information, and eventually might be the best way to approach it, but as of right now, it's faster and easier to go for the exome," Claudia Gonzaga-Jauregui, a postdoctoral fellow in Lupski's lab and one of the study's authors, told Clinical Sequencing News.
One motivation for re-analyzing Lupski's DNA was to use the data to validate the exome sequencing pipeline of Baylor's Whole Genome Laboratory, which launched a diagnostic exome sequencing test in 2011 (CSN 12/5/2012). "We needed a genome that was already characterized," she said, and one that had confirmed pathogenic variants.
The researchers sequenced Lupski's exome on four different platforms: Illumina's GAII and HiSeq 2000, and Life Technologies' Ion Torrent PGM and Ion Proton. In addition, they performed whole-genome sequencing using Illumina's HiSeq 2000. For the original study, they had sequenced his genome using Life Tech's SOLiD platform.
According to Gonzaga-Jauregui, the exome results "replicate pretty well among all the platforms," especially for single-nucleotide variants. Insertions and deletions are "still tricky" for all the platforms, she said, both because the sequencing chemistry might be less sensitive to indels and because some mapping algorithms handle indels better than others.
Indels have been found to be important in several diseases, and for new technologies in particular, "it's pretty important to really tune up their indel callers," she said. "That will have an impact, especially if you want to use [the results] for diagnostics."
She said the Illumina platforms, having been on the market the longest, provided good results, but the Ion Proton, which was launched more recently, also "has good potential." The Proton had slightly higher error rates for indels than the Illumina sequencers, while the Ion Torrent PGM showed higher error rates in general than the other platforms, and might be more suitable for more targeted sequencing projects. "I think the Proton has good options for the future, it just needs a little bit more of development," she said.
Baylor's diagnostic Whole Exome Sequencing test uses Illumina's HiSeq, mostly because that platform was already established when the test was introduced in 2011, she said.
Though exome sequencing will miss variants in regulatory regions or in coding regions that are not captured, and does not include structural variants, it is currently quicker than whole-genome sequencing. According to Gonzaga-Jauregui, it takes about one and a half to two weeks from start to finish to sequence an exome on a HiSeq 2000, compared to three to four weeks for a genome, although newer technologies, like the HiSeq 2500, are faster. The cost of an exome is still about one-tenth that of a genome, she said.
Importantly, exome sequencing identified an additional variant in the SH3TC2 gene in Lupski's DNA that whole-genome sequencing missed, because it was done at higher coverage.
When the researchers analyzed the whole-genome sequence data more closely, they found that the variant was actually present in some of the reads, but it was below the threshold of the variant caller. "That's also one of the things to be careful with, the thresholds you use," Gonzaga-Jauregui said. "Especially if you go for lower coverage, you also have to adjust your algorithms to be more sensitive. You may also have more false positives, but then you might not miss this kind of thing."
The researchers believe that the third SH3TC2 variant contributes to Lupski's phenotype, and functional studies are underway to further characterize its effects.
In addition, the exome data helped to clarify ambiguous incidental findings from the original study, where the data showed Lupski was homozygous for a Mendelian disease-causing mutation but did not show any symptoms of the disease.
In several cases, the exome sequence confirmed the whole-genome data and suggested that the database entry associating the variant with the disease might be wrong. "I think this is really important and a little bit worrying, especially if you trust these databases for doing your clinical interpretations," Gonzaga-Jauregui said.
In other cases, the exome data corrected false positive calls from the whole-genome data. For example, whole-genome data suggested Lupski was heterozygous for a disease-causing allele of a gene, but the exome data, which provided more reads, found that the variant caller had erroneously used reads that originated from a related pseudogene.
As a result of the study, Baylor decided to run its clinical exome sequencing test at high coverage, generating a mean coverage of 100x to 120x per exome, so 95 percent of the exome has at least 20x coverage. That way, the lab is "not sacrificing coverage for [savings in] cost or time," Gonzaga-Jauregui said. "For clinical exomes, it's way more important to have everything well covered."