This article was originally published July 3.
Depths of coverage used for many exome sequencing studies may miss a significant proportion of single nucleotide variants, a new study suggests, particularly those in the heterozygous state.
As they reported recently in BMC Bioinformatics, researchers from the University of Edinburgh and the Wellcome Trust Sanger Institute used data from dozens of already-sequenced exomes to look at how well they could detect SNVs in those protein-coding sequences, in relation to both the average sequence depth across the exomes as well as the depth of sequence at any given nucleotide.
As it turned out, the latter, per nucleotide sequencing depth — which reflects how evenly reads are distributed across the exome — was especially useful for estimating the sensitivity of SNV detection, they reported.
"What we found to be actually the most important measure for detecting whether you can score variants at a particular site or not is the coverage at that site," corresponding author Martin Taylor, a researcher with the University of Edinburgh's MRC Human Genetics Unit, told In Sequence.
In a situation where the average on-target depth of coverage is 30-fold, he explained, it's possible that half of the sequence is only minimally represented, while the other half is covered at a depth of 60-fold or greater. On the other hand, sequencing to a relatively uniform depth of 30x exome-wide would likely pick up most SNVs present across the coding sequencing.
The team's analysis hints that such per-nucleotide coverage is especially crucial in detecting heterozygous SNVs, where one allele matches the reference and one does not. Whereas an on-target depth of 20-fold exome coverage typically misses somewhere between 1 percent and 4 percent of homozygous SNVs, the group reported, the proportion of missed variants can jump to as much as 15 percent for heterozygous SNVs in exomes sequenced to the same depth.
That may be problematic depending on the research or clinical question at hand and the location of the most pertinent SNVs. So to help others recognize and address such issues in their own data, Taylor and his colleagues came up with software designed to determine SNV detection sensitivity for sequences of interest in a given dataset.
"We've provided the software with the publication to actually score where you're missing [SNVs] and where you're under power," Taylor said.
"So if you have particular genes of interest or particular genes which have shown up already in your study, you can then go and see if you've got good sensitivity in all of your patients or in all of the cases to detect the variant if it's there or not," he said.
Much of the past research on single nucleotide variants in exome and genome sequence data has focused on variants present in a given dataset, with less consideration given to those that aren't reliably identified.
Such missed variants may make it tricky for those trying to decipher exome sequence data in the clinical context and for disease studies. But it can be problematic when analyzing exomes for other types of studies, too, Taylor noted.
In his group's evolutionary biology studies, which focus on signs of selection in the genome, for instance, incomplete polymorphism profiles can lead to inaccurate conclusions about which parts of the genome have unusual polymorphism patterns or rates of genetic variation.
Taylor noted that SNVs may get overlooked due to bias during the DNA capture step of exome sequencing — which can lead to preferential capture of the allele present in the human reference sequence over a non-reference allele — or during the SNV calling step once reads are mapped back to the reference.
In a study published in Genome Research in 2011, for instance, researchers from the National Human Genome Research Institute showed that variants in nearly one-third of the exome may be missed when using average sequencing depths of around 30-fold (CSN 8/24/2011).
With that in mind, Taylor and his colleagues set out to explore the relationship between SNV detection sensitivity and exome sequence depth, using data for 30 deeply sequenced and well-characterized exomes.
The group's analysis focused on a set of verified variants found using the sequence data itself and through genotyping done for the HapMap project — additional validation aimed at ensuring that the SNVs being considered actually segregate with reasonable frequency in the population.
After identifying this "gold standard" variant set, the researchers did down sampling of the available data for each exome, using a randomly selected subset of the available reads to explore the SNV detection consequences of varying the depth of sequence available.
That approach is not only more cost-effective than re-sequencing exomes at various depths, Taylor noted, but it also allowed for detailed comparisons between the SNVs turning up in the down-sampled data and the actual variants present in the coding sequences.
"The key part of the analysis was to define the gold standard reference set of variants — variants which we knew were segregating in the population and had been genotyped by independent platforms," he said. "We had very high confidence that we could both see them in the full, deep alignments and that they genuinely were variants."
Results of the analysis suggested that it's the heterozygous SNVs that are most apt to get missed as a result of insufficiently deep exome sequence coverage, since it takes more reads to confidently call alleles that do not match the variant present in the reference sequence.
"By far and away the most common scenario is whereby the variant caller is not confident that there's a non-reference allele there and is not calling it. So you're missing the non-reference allele in the heterozygous state," Taylor said.
In general, he and his colleagues found that some 7 to 13 percent of heterozygous SNVs are prone to being erroneously called as homozygous for either the reference or non-reference allele at a given site represented by 10-fold read depth.
Moreover, they noted, even with an average on-target sequencing depth of 20-fold across the complete exome, anywhere from 5 to 15 percent of heterozygous SNVs could get overlooked, depending on the site considered, coverage uniformity, and so on.
Some homozygous SNVs get missed too, but those variants go unnoticed less frequently, Taylor noted, because it generally takes only a few high-confidence reads to verify the presence of matching alleles at a given position. At the 20-fold average coverage threshold, for example, the group estimated that some 1- to 4 percent of homozygous SNVs elude detection.
"If there's a sequencing error or low-quality reads that make the caller suspect that there might be a reference allele there as well, sometimes a homozygous gets called as a false-heterozygous," Taylor said. "But these are relatively rare."
Consequently, the group predicted that those interested in finding at least 95 percent of the heterozygous SNVs overall would need at least 13-fold read depth at the site being considered, while those interested in seeing a similar proportion of homozygous variants would need just 3-fold coverage at a given site.
Even higher overall sequencing depths are likely needed to achieve that level of sensitivity across the exome, the researchers noted, again depending on how uniformly distributed the sequence reads are across the exome.
Somewhat unexpectedly, the analysis indicated that the sensitivity for seeing SNVs actually jumps up a little at lower sequencing depths when SNVs are nestled in more complicated stretches of sequence — a finding that the study's authors suspect may stem from an over-representation of strongly hybridizing and easy-to-capture guanine and cytosine nucleotides in more complex portions of the exome.
"The higher stickiness of the GC sequences could partially compensate, is our speculation," Taylor said, though he emphasized that the team has not yet formally tested that notion.
The analysis was not designed to compare the sensitivity for calling SNVs from sequences generated with different sequencing platforms, since all of the exomes considered had been sequenced with the Illumina GAII instrument. Nor was it intended to provide any sort of head-to-head comparison of sequence capture approaches.
Nevertheless, because four capture methods had been used to sequence the exomes at hand for their analysis, the researchers did get hints about differences in coverage uniformity associated with each.
For instance, Taylor noted that the exomes prepared with custom array-based capture techniques appeared to have somewhat more uniform coverage than those sequenced in conjunction with solution-based capture kits from NimbleGen or Agilent. Between the two solution capture methods, meanwhile, NimbleGen capture tended to coincide with more uniform exome coverage than the Agilent kit.
Overall, findings from the study suggest that researchers do need to keep sequence depth over a specific gene or set of sequences in mind when looking for disease-associated variants or other SNVs of interest.
Even so, those involved in the new analysis noted that the depth of coverage needed might not be extreme, as long as researchers are aware of the places in the sequence that are prone to missed variants.
"We are not advocating the use of an excessively deep threshold to call polymorphisms," Taylor and his co-authors noted. "[I]t makes sense to maximally use the available sequence information in an attempt to call variants even in regions of low sequence coverage."
"However," they added, "it is important to quantify how likely a polymorphism is to remain undetected."
For instance, they are currently advising researchers at their own institute to aim for around 60-fold on target coverage, on average for clinical or other exome sequencing studies, Taylor said. At that depth "you're generally doing pretty well," he said, though he cautioned that the precise depth of sequence needed is liable to vary depending on the nature of the study and portions of the genome considered.
To help others get a better sense of SNV detection sensitivity in their own sequences of interest, the team has developed software for determining SNV detection sensitivity across selected stretches of sequence — from a whole genome or exome down to the individual nucleotide level.
"You can arbitrarily define whatever your focal interest is and say, 'What's my sensitivity here?'" Taylor said.
The software, which he and his co-authors are making available to other researchers online, can be used for analyzing newly sequenced exomes or applied retrospectively to existing sequence data.
"You would apply the software, essentially to the end of your variant calling analysis," Taylor said. "So it's a retrospective analysis of what you've actually got and where you have got power and where you haven't got power."
For their part, Taylor and his colleagues are gearing up to do the same type of SNV detection analysis on whole-genome sequences to see whether there are portions of the genome where SNVs are more difficult to detect and, if so, where those regions are.
"Obviously exome sequencing has been quite widely used," Taylor said. "But more and more people are going down the whole-genome road, where you haven't got biases introduced by capture platforms but you have got more variation in nucleotide composition of the genome, for example."
"We're wondering how these things are affecting the same sort of properties," he said, noting that it remains to be seen, for instance, whether per-nucleotide depths have as much influence on SNV detection sensitivity in the absence of an exome capture step.