NEW YORK (GenomeWeb) – Many regions of the genome cannot be confidently sequenced and accurately variant called with current whole-genome and exome sequencing methods, according to researchers from the National Institute of Standards and Technology and Stanford University.
Reporting this week in Genome Medicine, the group used the reference genome released by the Genome in a Bottle Consortium with benchmark SNV, indel, and homozygous reference genotypes based on the NA12878 genome to assess one whole-genome pipeline and an exome sequencing pipeline for specific classes of variations in medically relevant genomic regions. They evaluated sensitivity and positive predictive value of the pipelines as well as the relationship between accuracy and genomic complexity.
The group found that the pipelines varied in their ability depending on the genomic region where the variant is located, the type of variant, and how well covered the genomic region is.
In addition, when they assessed the reference dataset, they found that only 990 genes are located in regions that the Genome in a Bottle Consortium genotyped with high-confidence.
"This means that our benchmark genome cannot currently be used to assess performance for more challenging genes and other difficult regions of the genome that already are being tested or for which new sequencing methods are being developed," NIST biomedical engineer Justin Zook said in a statement. "The harder-to-characterize regions that we can't yet sequence with confidence include regions known to be clinically important," he added.
To compare sequencing pipelines with the Genome in a Bottle reference sample, researchers at the Garvan Institute of Medical Research performed exome sequencing using the Nextera exome kit and Illumina HiSeq 2000 instrument, while the NIST team performed whole-genome sequencing using Nextera PCR-free v2 chemistry and the HiSeq 2500 to 50X and 30X coverage, respectively. Each team used different variant-calling pipelines.
They then compared their variant calls with the reference.
The team focused their analysis specifically on two medically relevant gene sets: the ACMG 56 genes and the 3,300 genes that are in either the ClinVar or OMIM databases and have known relevance to human disease.
In order to benchmark the accuracy of the sequencing protocols, the researchers focused the analysis only on "high-confidence" regions — areas of the reference material that the Genome in a Bottle Consortium felt it had genotyped confidently. Areas that were not considered high-confidence regions were areas of low coverage; areas that contained paralogous sequences, repetitive elements, structural variants, or segmental duplications; and areas where all sequencing chemistries were prone to systematic errors, the authors wrote. Then they looked at the proportion of the whole-genome and exome sequencing data that fell into those regions.
In the high-confidence regions, the whole-genome sequencing pipeline had equal or higher sensitivity compared to exome sequencing for both SNVs and indels.
For exome sequencing, the primary reason for low sensitivity was insufficient coverage, while for whole-genome sequencing, the primary reason for false negative calls was the filtering out of those calls due to falling in difficult-to-sequence or difficult-to-call regions of the genome.
The researchers also identified 39,301 loci where the benchmark data contained a high-confidence reference call, but at least one sequencing technology incorrectly called a variant as sites with systematic errors . In addition, 7,467 of those variants are in a variant database — where they could be either errors caused by systematic bias or real variants that are not present in the reference genome — highlighting the difficulty of distinguishing between real calls and errors.
For example, the researchers noted that one site in particular, a truncating variant in the BRCA2 gene, was likely to be a real, disease-associated variant, however, it is an indel in a homopolymer region that was flagged as being a site of "systematic error."
Finally, the researchers wanted to look at their ability to confidently call variants in medically relevant genes. They calculated that 82.1 percent of the bases in coding regions of the 56 ACMG genes were in high-confidence regions. In some cases, the entire gene was outside of a high-confidence area, while in other cases the entire gene was completely within the high-confidence area.
In addition, only 74.6 percent of the ClinVar and OMIM genes' exonic bases and only 72.7 percent of all exonic bases in protein coding genes were in high-confidence regions.
"The challenge now is to focus our efforts on the other 23 percent — namely, on regions of the genome that remain elusive," lead author of the study, Rachel Goldfeder, said in a statement.