Exonic regions outside the Consensus Coding DNA Sequence model may be difficult to capture, map, and sequence, but they also contain important functional elements currently missed by many exome sequencing disease studies, according to a new study by Baylor College of Medicine researchers.
CCDS exons, which are carefully curated by an international consortium, are frequently used to design capture reagents, but because these exons are determined by conservative criteria, many genes found in other databases, such as RefSeq or VEGA (the Vertebrate Genome Annotation Database), as well as computationally predicted exons, are missing.
In an effort to evaluate the efficacy of sequencing and interrogation of variants across a much broader swath of the genome, researchers led by Baylor's Human Genome Sequencing Center recently examined capture data from outside the CCDS.
In their study, published last month in Genome Biology, the authors reported that higher GC content outside of CCDS regions reduced local capture sequence coverage to less than 50 percent of that seen in CCDS regions — an effect "due to biases inherent in both the Illumina and SOLiD sequencing platforms that are exacerbated by the capture process." However, variant density in areas outside the CCDS — especially for computationally predicted exons — was much higher, suggesting that these regions may be important to include in disease studies.
Matthew Bainbridge, the study’s first author, told In Sequence this week that the team wanted to evaluate the possibility of looking outside traditional coding regions. "We started pushing out and adding other gene models … the VEGA [model], in particular, is what we added to this in addition to RefSeq and CCDS," he said. "And we're also interested in looking beyond coding regions [into] regulatory regions, [as well as] computationally predicted exons within known genes."
Bainbridge said the group decided to evaluate regions outside the CCDS because "no gene models are perfect" — especially when it comes to disease studies.
The addition of computationally predicted exons, he said, seemed particularly important. "We have chips now that target eight gene collections. So if you go out and you get RefSeq and UCSC [genes], and you get everybody's gene collection, it's still possible that they’ve missed real exons that are rarely expressed within known genes. The idea is that … just because they aren’t in a gene model doesn’t mean they aren’t real."
Bainbridge noted that many exome sequencing studies fail to find genes associated with disease, which may be due to the fact that these studies are missing important regions of the genome. "We think that we should really be pushing into much more inclusive models going after many, many, many more targets and a much larger region," he said.
'Capture-ability' and Variant Density
The Baylor group created two new capture reagents to expand the total area that they could target. The first reagent, VCR-set, targeted the miRNA, VEGA, RefSeq, and CCDS gene models with a total target size of 42 Mbp. The second design, REC-set (Regulome, Exons, Conserved elements), added conserved untranslated regions, regulatory regions, and areas that were computationally predicted to be exonic.
These reagents, they wrote, allowed them to "determine the relative 'capture-ability' of subregions of the genome compared to the CCDS." The group defined capture-ability of these subregions as their average coverage relative to the average sequence coverage of the CCDS.
The researchers used their new reagents to capture sequences from a cell line and several human samples. In total, they aligned more than 54 Gbp of capture sequence data derived from seven libraries and five DNA samples to the human reference genome, using both the Illumina and SOLiD sequencing platforms.
Overall, the study found that regions outside the CCDS were "almost uniformly" less capture-able. The CCDS regions had 10 percent to 15 percent higher average coverage than the REC-set target regions as a whole, the authors wrote. Computationally predicted exons were more capture-able than the CCDS average, while conserved UTR and regulome regions performed worse.
The Illumina and SOLiD platforms showed similar biases, they wrote, and capture success appeared to be confounded by technology biases associated with the GC content of the target sequences, a known issue in short-read sequencing. Untranslated regions, which are approximately 30 percent GC, and regulatory regions, around 70 percent GC, had approximately half the coverage depth as the CCDS regions, which are about 50 percent GC.
"It's been known for a long time that sequencers tend to have problems with GC extremes," Bainbridge said. "Illumina is very bad at low GC, and SOLiD is very bad at high GC. But there is actually an additional impact because there [are] also problems creating the probes that capture these regions, so just generating the capture reagent probes seems to be problematic at GC extremes, which maybe isn't too surprising either."
Illumina sequencing "consistently showed higher variant density than SOLiD," the authors wrote, suggesting the discrepancy may reflect "inherently higher accuracy of SOLiD sequencing." However, overall specific regions the group targeted showed similar variant densities for both platforms.
"One of the major findings," Bainbridge said, "is that there is a radical difference in the variant density within the various regions that we captured."
[ pagebreak ]
The group expected that the CCDS would show fewer variants, because it is a highly evolutionarily conserved region. But, as the study expanded out into RefSeq and other gene models, the researchers found more and more variants, he said.
"This has an impact on a lot of people's studies, because originally when people started doing this work, we thought we'd find maybe 80 rare variants in exomes. So if I sequenced your CCDS exome … I'd maybe find 80 or [fewer] rare variants," said Bainbridge.
"But if I sequence your UCSC exome, or your Ensembl exome, I'd find many, many, many rare variants, maybe 400 to 600. So this means I would have to do many more samples in order to filter through and find the interesting ones."
Because of this, Bainbridge said, capturing areas potentially relevant to disease in these wider fields will require more sequencing data to be generated — between 20 and 40 percent more, according to the group's report. "As we start capturing more and more material, we'll actually need bigger sample sizes or better controls in order to compensate for that and get down to those few variants that are causing disease," he said.
The predicted exons and regulatory regions exhibited more than twice the variant density of the CCDS exome, the authors wrote, suggesting that these regions are either more tolerant to variation, or that they may have increased mutation rates compared to the whole genome.
According to Bainbridge, there are a few ways researchers could improve their ability to capture regions beyond the CCDS using either SOLiD or Illumina technology. "One of the things you can do is sort of change the environment … you can do things that try to help out low-GC or high-GC regions," he said. "You can change the buffer; you can sometimes change the enzymes you use that are better for these GC extremes."
One of the ideas the Baylor team is exploring is a combined approach, starting with a standard capture using the CCDS, or RefSeq, or some other model, or even a whole genome, and then following "with a capture reagent that deliberately targets these GC extremes with a special reagent and a special buffer and special conditions in order to capture these regions and sequence them better," Bainbridge said.
In their paper, the researchers wrote that they believe this to be the first targeted-sequence capture study of a "genome-wide, diverse" set of elements allowing investigation of variant densities in locations that have been previously undetected, at a "fraction of the cost of whole genome sequencing."
Bainbridge said he thinks exome sequencing will remain broadly popular for several years, with niche applications lingering much longer, even as the cost of whole-genome sequencing drops.
"I think generally, I would guess two to four years, for most things, and in some specific applications I think a lot longer," he said.
The cost of exome sequencing "will always be ten percent of the cost for whole genome … so people will always like that." Furthermore, he said, a lot of data generated by whole-genome sequencing "just isn't that useful."
"Most people, especially if you look two years ago, even the genome centers, all their studies [concentrated] on coding mutations, which means they're throwing away 99 percent of their data," he said.
For particular applications like cancer, Bainbridge said, exome sequencing may long be a more attractive choice. "In cancer, you have to worry about polyploidy, you have to worry about contaminations, you have to worry that maybe seventy percent of your sequence is actually sequencing stroma, so you have to sequence a lot deeper," he said.
"The impact cost of actually doing the capture isn't that much in these cases. So in any case where you're sampling a cell population, I think you'll see that capture sticks around for a lot longer."
What is likely to happen, he said, is an increase in combined or hybrid approaches, where a "light" sequencing of the whole genome is followed with capture to give higher coverage of specific regions.
According to the group's study, there is evidence that these regions should expand beyond conservative models like the CCDS. "I would guess that between forty and sixty percent of Mendelian disease studies fail to find the causes of mutation. And that could either be because we're finding the mutations and we don’t know they are causative … But maybe it's in these regions we aren't capturing at all," Bainbridge said.
"If any particular region is genic, if it's being expressed, it can potentially impact disease, or human health in any sort of way, so I think … getting [these regions] is sort of critical, especially when we are looking at populations."
Bainbridge said the group has now been working to refine its regulatory area work, trying to target "much smaller transcription factor binding sites."
"We've actually done a lot of work on that, refining the gene models we use, as well as going after non-coding things where we still think we have interpretable biological significance," he said.
Have topics you'd like to see covered in In Sequence? Contact the editor at mashford [at] genomeweb [.] com.