By Julia Karow
Exome sequencing picks up additional disease-related and other variants that a typical whole-genome sequencing experiment of the same sample does not, according to a recent study by a team of researchers at Stanford University, suggesting a continued role for targeted sequencing despite the falling prices for WGS.
The study, published online in Nature Biotechnology on Sunday, also compared the performance of three major exome enrichment platforms — from Agilent, Roche/NimbleGen, and Illumina — and found that they differ in content, how efficiently they capture their targets, and how many variants they detect.
Many research groups are currently using exome sequencing, which has become especially popular to study rare Mendelian diseases because it is more affordable than whole-genome sequencing and thus allows one to analyze more samples, said Mike Snyder, director of the Center for Genomics and Personalized Medicine at Stanford University and the senior author of the study. "Most people are working with a fixed budget; they are trying to get as many samples [done] as they can for their project."
However, no independent systematic comparison of the three main exome enrichment technologies — Agilent's SureSelect Human All Exon 50Mb, Roche/NimbleGen's SeqCap EZ Exome Library 2.0, and Illumina's TruSeq Exome Enrichment — had been published, so "we felt we needed to evaluate these in order to be able to make informed decisions about which ones to go forward with for [our] studies," Snyder said. "Given that all of this is incredibly expensive, it really makes sense to evaluate the performance of these platforms."
His team also decided to compare exome sequencing with a typical whole-genome sequencing experiment of the same sample, with somewhat surprising results.
Exome vs. WGS
For their comparison of whole-genome sequencing and exome sequencing, the Stanford researchers sequenced a blood sample from a healthy volunteer to 35x genome-wide coverage, generating about 1.2 billion 100-base paired-end reads on the Illumina HiSeq 2000.
For the same sample, they generated 50 million reads after enrichment with each of the exome technologies, resulting in 30x mean target coverage for TruSeq Exome, 60x for SureSelect, and 68x for Nimblegen.
They then called single-nucleotide variants from both the WGS and the exome data sets, using the same cut-offs and filters. For their analysis, they focused only on those regions targeted by the exome-enrichment platforms.
For each platform, they found between 650 and 4,600 SNVs that were detected in the WGS data but not in the exome data, but also between 2,600 and 3,100 SNVs that were only called from the exome data but not the WGS data.
WGS-specific SNVs often had no coverage at all in the exome data, suggesting that the enrichment of these regions failed.
Exome sequencing-specific variants, on the other hand, were usually present in the WGS reads, but not at sufficient read depth to be called with confidence. That was even true for the Illumina exome data, which only had 30x mean target coverage, similar to the 35x coverage of WGS.
"The advantage of exome sequencing is, you're just doing really deep sequencing on a portion of the genome … and therefore, you get better coverage of exomes than whole-genome sequencing does," Snyder said.
Interestingly, about 300 SNVs that were identified by all three of the exome sequencing platforms but not by WGS are associated with human diseases, suggesting that exome sequencing can pick up variants with clinical relevance that WGS alone would miss. The article did not say, though, how many WGS-specific variants are disease-associated and thus would not be found by exome sequencing alone.
"It was definitely surprising to me that the exome [sequencing] was finding information that the genome [sequencing] did not pick up," said Snyder. "Some of these are important regions — you can't just blow these off."
Given these results, it might make sense to do both WGS and exome sequencing "to make sure you are really covering your exome variants," he said. "If you can afford it, that's a good thing to do since you will get extra information from your exome that you would not have gotten from the genome."
Comparing Capture Platforms
All three exome-enrichment platforms differ in content: NimbleGen has the smallest overall target size — 44 megabases — and covers a greater portion of miRNAs than the others; Agilent covers 51.5 megabases and has better coverage of Ensembl genes than the others; and Illumina covers 62 megabases and includes many more untranslated regions than its competitors. The three platforms have about 30 megabases of targets in common, almost all mRNA coding exons.
They also differ in how they cover their target regions with baits. While Nimblegen uses overlapping baits that cover each base several times, Agilent's baits sit next to each other, and Illumina "relies on paired-end reads to extend outside the bait sequences and fill in the gaps," according to the paper. Agilent uses RNA baits, and the other two DNA baits, and while the enrichment procedure takes about 3.5 days for Agilent and Illumina, it is twice as long for NimbleGen.
To test the three platforms' enrichment efficiencies, the Stanford team compared data sets of 80 million mapped reads for each platform, generated on the Illumina HiSeq. They also compared target coverage at lower read counts, and at different cut-off depths.
At 80 million reads, for example, 97 percent of targets were covered more than 10-fold by NimbleGen, but only 90 percent by Illumina and Agilent. NimbleGen covered almost 99 percent of its targets at least once, while Illumina and Agilent covered about 97 percent.
Overall, NimbleGen enriched a higher percentage of its targeted bases than the other two, while Illumina and Agilent enriched a higher total number of bases at higher read counts. That, the authors wrote, is a function of the different designs of the platforms. For NimbleGen, "a higher-density design, targeting a smaller genomic interval, results in higher efficiency," while for Agilent and Illumina, "lower-density designs can capture a greater number of bases but require substantially larger amounts of sequencing."
[ pagebreak ]
The researchers also found that about a third of the reads mapped to off-target regions for Illumina, but only 13 percent of reads for Agilent and 9 percent for NimbleGen.
All three platforms covered regions with high GC content less well than other regions. Agilent, however, was better than the other two at covering regions of low GC content, probably because of its longer baits, RNA probes, or its lower number of PCR cycles. According to Agilent, another reason for the better performance is its optimized library preparation kits and the use of Herculase II polymerase.
Illumina's platform detected the largest total number of single-nucleotide variants — 53,000 — while Agilent detected 50,600 and NimbleGen 47,000. However, in regions shared between the three platforms, NimbleGen "consistently captured the most SNVs and became saturated with the lowest number of reads, followed by Agilent and then Illumina, indicating a correlation between bait density and sensitivity to SNV detection," the authors wrote.
Pricing for all three platforms is "highly negotiable with the vendors," according to the authors, ranging from less than $400 to more than $1,000 per reaction. Snyder said that initially, NimbleGen was cheaper than Agilent, which cost less than Illumina, but that pricing differences are now "getting more subtle" as the companies compete with each other. Prices overall are approaching "more like" $400 per sample now, he said.
Which platform is best suited for a particular project depends on what regions a scientist is interested in, the researchers concluded.
Since Illumina's is "the only platform that is designed to enrich UTRs, which are almost completely untargeted by the other two platforms," it is the platform of choice for researchers interested in those regions, for example.
For the RefSeq exome, "Nimblegen has a slight edge in sensitivity for SNPs and small indels," they wrote, but for Ensembl CDS regions, Agilent's kit "can detect the most SNPs and small indels given slightly more sequencing."
All platforms are able to detect disease-associated variants, of which "a small proportion are unique to each platform."
Snyder said the comparison also showed that the NimbleGen design has advantages for custom sequencing projects, for example to double-check variants found by whole-genome sequencing, because its overlapping probes provide the best coverage.
His own lab currently uses both NimbleGen's and Agilent's enrichment platforms. "We think they are both quite good, and we probably have a modest preference towards Agilent right now," he said.
Vendors Weigh In
According to Marilou Wijdicks, Roche NimbleGen's international product manager for research, the SeqCap EZ Exome v2.0 kit is "a very cost-effective tool" for researchers focusing on RefSeq regions only.
However, the company will "very shortly" be launching a more comprehensive exome capture kit, called SeqCap EZ Exome v3.0, that will target more than 64 megabases of exons and miRNAs. It will include additional gene annotations, for example from Ensembl and VEGA, as well as the latest updates from RefSeq, CCDS, and miRBase. While the new kit will require "slightly more sequencing," it will "retain all the benefits of the high-performance v2.0 design that was used in this study," she said.
Unlike Illumina, NimbleGen has not included UTR regions in its exome capture designs, Wijdicks said, because they are only one of several classes of potentially important non-coding regions, which add up to 200 megabases. Because "there isn't a consensus among researchers with respect to what non-coding content should be included on a 'stock' exome design," NimbleGen designs custom capture kits according to researchers' specifications instead. "Using this customized approach, researchers can optimize their experiments to target only the regions they feel are important," she said.
"Looking to the future, our technology will allow us to continue to expand the size of our target regions based on what researchers require, even up to over 100 Mb in a single design," she added.
According to Illumina's senior director of applications marketing, Peter Fromen, the platform comparison is "not entirely" fair because Illumina has since launched a gel-free TruSeq DNA sample prep protocol that "results in a lower number of duplicates and leads to better sample uniformity." Most studies also now use the latest versions of Illumina's TruSeq PE and Cluster Kits, which he said further improve the efficiency of sequencing to deeper coverage.
Fromen also said that to assess the effectiveness of target capture, the study looked at the probe files of the two competitors but the target region file for the Illumina kit, which he said is "an apples-to-oranges comparison that causes our competitors to look slightly better in this single metric."
Illumina's TruSeq Exome Enrichment kit deliberately targets non-coding regions, in addition to coding regions, he said, because genome-wide association studies have found those regions to be important to explain the underlying biology. Fromen said that the increased amount of sequence needed for Illumina's kit "is marginal and the benefits worthwhile" in the end.
Regarding cost, Illumina's kit allows customers to pool six samples in a single enrichment reaction, he pointed out, which brings down the enrichment costs per sample.
Because the three platforms have such different content, there is "no simple way to easily compare" their designs, according to Fred Ernani, Agilent's marketing director for the SureSelect NGS Platform.
Agilent's own kit may not cover all the targets that others do, such as Illumina's UTRs, but "Agilent chose to go this route to satisfy the majority of our customer’s desires for minimizing sequencing costs," he said. Both the Wellcome Trust Sanger Institute and the Broad Institute helped design the SureSelect Human All Exon 50 Mb kit, because these experts "should know better than us which content is the most relevant to discovery studies."
The fact that Agilent's kit targets 1.4 megabases of coding regions from Ensembl CDS that the other two don't is "of particular relevance to those researchers looking to uncover the cause of Mendelian disorders," he said, since most of these are thought to be mutations in coding regions.
Ernani said the company continues to work with its collaborators "to enhance both the content and the performance" of its exome platform.
Have topics you'd like to see covered in In Sequence? Contact the editor at jkarow [at] genomeweb [.] com.