By Julia Karow
This story was originally published on Aug. 18.
As human whole-genome sequencing transitions from research into clinical applications, better descriptions of the accuracy and completeness of a genome will be required, according to researchers at the National Human Genome Research Institute.
The NHGRI team, which recently assessed the accuracy and comprehensiveness of sequencing clinical samples on the Illumina platform, also found that the current practice of 30x coverage is not enough to guarantee confident genotype-calling across the majority of the genome. They determined that sequencing more upfront and using filters can increase accuracy and save validation costs later.
"The field has reached a level of maturity where we can get a better understanding about how much data you really need to get a certain level of comprehensiveness and accuracy," said Elliott Margulies, the paper's senior author. "Until this point, very few people have been reporting those types of metrics because we have been able to get the low-hanging fruit by sequencing a genome" at about 30x coverage.
Margulies was until recently an investigator in the genome technology branch at NHGRI and is now the director of sequencing applications at Illumina's Little Chesterford location in the UK.
"If we want to take [sequencing] to the next step and have this be clinically relevant, we need to get a much better understanding of accuracy, and sensitivity and specificity, [and] what you can do, and knowing what you can't do" said Margulies.
In order to assess and improve the SNP genotyping accuracy for a single genome, he and his colleagues sequenced a sample from the NIH Undiagnosed Diseases Program to about 126x average depth, using both the Illumina GAIIx and HiSeq 2000 platforms. They published their results online last month in Genome Research.
They split the data into two equal-sized datasets and looked for differences between the genotypes called in each set. Based on those differences, they developed filters that take into account the coverage at each site, rather than the average coverage across the genome, and use a higher confidence score at sites with higher coverage.
Those filters "really helped us increase the accuracy," Margulies said. "Once you apply these additional filters, you can have a very high specificity even at very low depth of coverage, it's just that you are calling a small proportion of the genome," he added. In other words, while the filters reduce the error rate, they require a higher depth of coverage to be able to genotype the same fraction of the genome — say, 50x instead of 30x.
Using the filters, the researchers assessed what percentage of the genome they were able to genotype accurately — or the "callable" fraction of the genome — given a certain level of coverage, from 5x to 100x depth. That metric, Margulies said, is important to report and is independent of the sequencing technology used.
"Saying 'depth of coverage' is really tied to a specific version of a specific chemistry of a specific technology. But saying, 'With a certain level of specificity, I was able to call genotypes on X percent of this reference genome,' I'm agnostic to all of the methods, and it gives a more objective metric of how confident you can be in the calls that you have made, and that will start to help allow you to use something like this in a clinical setting," he said.
Sequencing a genome to higher depth and using filters also saves validation costs later, he said. "Sequence a little more, and you'll get so much more accuracy. This way, you don't have to spend the time and the money and the effort in trying to validate too many things, because you'll already have a much greater reduced false-positive rate," Margulies said. "It becomes more cost-effective in the overall grand scheme of things to sequence more, so you can more easily analyze the data downstream."
[ pagebreak ]
According to David Dooling, assistant director of the Genome Institute at Washington University, that might be a good strategy for sequencing projects like the 1000 Genomes Project, where little validation will be done, but not necessarily for cancer genome sequencing.
Because cancer samples are often heterogeneous and impure, and only a fraction of the cells may contain a somatic mutation, "you tend to tune your algorithms to be much more sensitive, and tolerate a higher false-positive rate, because then you go back and validate those somatic variants," he explained.
It thus depends on the type of experiment whether sequencing a lot at the beginning and using stringent filters or validating later with targeted sequencing is more cost-effective. But as the cost of sequencing comes down, "it shifts the balance towards just more and more sequence at the front end," Dooling said.
He noted that researchers are already reporting how comprehensively they sequenced a human genome by comparing their data to homozygous and heterozygous SNPs on a genome-wide SNP array, but additional standards will likely arise. "Certainly, as sequencing transitions into the clinical space, whether people want them or not, these sorts of standards are going to arrive," he said. "It's just a matter of making them appropriate for the various different types of experiments that people want to do."
For example, he said, demands for coverage might be different if the aim is to characterize structural variation, compared to single-nucleotide variants, and may differ between germline analyses of a genome and analyzing tumors.
Rade Drmanac, CSO of Complete Genomics, said he agrees with Margulies' finding that "high-depth sequencing certainly improves accuracy, both for germline DNA samples and even more so in tumor DNA samples." Thus, Complete Genomics sequences samples to "well over" 55x average mapped coverage and offers customers even higher-depth sequencing as an option.
Such higher depth, he said, is "particularly useful in tumors," where relevant alleles may only be present in a small fraction of the reads, and is "even more critical for more complex variants not examined in this paper."
He also said he agrees it is important to report what portion of the genome could not be called. "A call of 'homozygous reference' is clearly a different thing than a statement of 'I don't know,' although many other papers … do not quantitatively make this distinction."
According to Drmanac, what is needed to evaluate the performance of sequencing methods is a well-characterized set of reference genomes. "We need up to 100 genomes sequenced with multiple methods … with extreme coverage that have a 'perfectly' sequenced set of genomes that can be used to evaluate performance of routinely using sequencing methods," he said.
Not everything can be solved with sequencing to higher depth, however, because of technological limits. "The Holy Grail of genome sequencing is to be able to start at one end of a chromosome, and read all the way to the other end in one long read that is perfectly accurate," Margulies said. "And until we get to the day that we can do that, we are always going to be making compromises." What's important, he said, is to know what portion of the genome is "accessible" with a certain technology.
Using short-read technologies, for example, even with paired-end reads that are perfectly accurate, "there are going to be places in the genome where you can't unambiguously align these things," he said.
Long reads, for example from Pacific Biosciences, will help increase the "callable" portion of the genome, but at the moment, their error rate is too high to use them on their own.
In the meantime, combinations of technologies that generate different types of data — for example of PacBio and Illumina — could provide more accurate and complete genomes, he said.
Complete Genomics, for its part, is exploring ways to prepare multiple libraries from the same sample in order to increase the completeness and accuracy of a genome, including haplotyping, Drmanac said.
Have topics you'd like to see covered in Clinical Sequencing News? Contact the editor at jkarow [at] genomeweb [.] com.