NEW YORK (GenomeWeb) – As genomic sequencing continues to accelerate in the research sphere and the clinic, the inadequacies of current systems for collating, annotating, and interpreting discovered variants have become more apparent, and the need to standardize or harmonize existing tools has become more pressing.
In a report in Genome Medicine last week, researchers from human genome sequencing firm Personalis shared results from their comparison of different tools for translating genomic variant data into a syntax that can be referenced across legacy transcriptomic or proteomic reports. Overall, they found there was "significant inconsistency," both from tool to tool and in comparison to existing databases.
"While some of these syntax differences may be clear to a clinician, they can confound variant matching, an important step in variant classification," the authors wrote. This highlights an "urgent need for the adoption and adherence to uniform standards," they added.
Heidi Rehm, director of the Laboratory for Molecular Medicine at Partners Healthcare Personalized Medicine, who was not involved in the project, agreed that pointing out the major issues with nomenclature and annotation is an important effort. The challenges that the Personalis study highlights are indeed a major issue, she wrote in an email.
Ensuring that calls on a variant’s pathogenicity don't vary from lab to lab or clinician to clinician is drawing increasing attention in the clinical sequencing community. Variant annotation, basically the description of a variant's location and the prediction of its functional effect, may seem like a small piece of the larger classification picture, but it is actually a fundamental part of a clinical assessment, first author Jennifer Yen and her colleagues wrote.
To cross-identify variants called on their genomic position in this new age of sequencing to resources that rely on transcript- or protein-based descriptions, researchers use a variety of automated syntax generation tools, for which there is increasing demand, according to the Personalis team.
"Before the availability of a reference genome, all the variant nomenclature, and people reporting on variants, was done originally [based] on transcript and protein sequencing [data] because that was all they had," Yen said in an interview.
"Now, all the data analysis is genomic, so what people are doing is translating from the primary source, the genomics, to secondary sources, which is the transcript or protein sequences so that we can cross-match with these legacy databases and with the literature, which is always protein-based."
Especially in terms of cross-referencing reports in the scientific literature, one might never be able to find a particular clinical reference without knowing that you need to tweak your annotation, she said.
In their study, Yen and colleagues put three of the automated tools that have been developed for this up against one another, revealing that results can vary, creating discordance that hampers efforts to build understanding or consensus on genetic changes and their links to disease.
For their comparison, the group curated a test set of 126 variants to establish a "ground truth" that they could then use to evaluate the accuracy of three engines that generate transcript- and protein-based variant nomenclature from genomic coordinates — SnpEff, Variant Effect Predictor, and Variation Reporter.
Standards and guidelines for describing variants at the genomic, transcript, and protein levels are provided by the Human Genome Variation Society (HGVS), which developed and published initial recommendations in 2000, when testing was still largely transcript- rather than genome-based. The society then updated these guidelines last year to reflect changes in various nomenclature descriptions. SnpEff, VEP, and VR all produce HGVS syntax.
For its ground truth set, the Personalis team chose fifty variants from public repositories, including ClinVar, dbSNP, and COSMIC. They also added 76 synthetic variants to make sure they represented variants of different types and with different genomic features.
Evaluating annotations according to the HGVS guidelines, the researchers labelled annotations as either "exact" or "equivalent" matches, both of which were considered correct for the purposes of the study.
They further evaluated the concordance between annotations generated by SnpEff and VEP with those in two major databases: ClinVar for germline variants and COSMIC for cancer mutations.
Overall, the team found that there was imperfect concordance between the tools, both at the coding level, but also even more so at the protein level.
Looking at SnpEff and VEP in relation to ClinVar, the team found that the rate of "exact" concordance for SNVs was remarkably high — over 99.5 percent between ClinVar and the tools. This dropped to less than 90 percent for non-SNV variants.
At the protein level, concordance was high for SNVs but for other alterations, up to 70 percent of the annotations by the two syntax tools did not perfectly match the ClinVar HGVS, and up to 20 percent were completely discordant.
For COSMIC, the team found that the rate of exactly concordant calls at the coding level was about 86 percent — compared to 77 percent for SNPEff.
For deletions , the team reported that that concordance between tools' and COSMIC's nomenclature was less than 58 percent. Because SnpEff and VEP syntax agreed with each other in the vast majority of these cases, it suggests that it is the COSMIC syntax that is incorrect.
"Based on the agreement of VEP and SnpEff alone, our results suggest that between at least 5 and 10 percent of COSMIC variant annotations are incorrect. This is concerning, given its transition from a research repository to a major clinical resource, although efforts to comply with genomic and HGVS standards are apparently underway," the authors wrote.
According to Liying Zhang, director of the diagnostic molecular genetics laboratory at Memorial Sloan Kettering Cancer Center, some of the study findings — the fact that exisiting annotation tools generallydo better on SNVs than indels for example — are fairly well known in the field.
"We are fully aware of the limitations on indels given the complexity of the genomic sequence in some regions and HGVS rules are set by the people in the field. It is not uncommon to see the discordant annotations in existing databases as some variants were annotated before HGVS rules and the free annotations tools were available," she said in an email.
Zhang also questioned whether the Personalis team's selection of variants for their ground truth set may overestimate discordance, because of the choice of indel variants and variants in difficult regions like splice sites.
"These variants are relatively less common than SNVs which have high concordance rate," she explained.
Yen said that she and her colleagues hoped their challenging ground truth methodology would draw attention to the issue that there is not as much uniformity in how these tools and databases describe a variant, even at the genomic level, as clinical diagnostics should demand.
"We had these cases we were missing, where you have to do a lot of manual interpretation because the tools are incorrect," Yen said.
Illustrating this, the authors highlighted one particular frameshift variant in the PROK2 gene, which was differently classified by two curators in Personalis' own lab. One person classified it as likely pathogenic and the other as pathogenic for Kallman syndrome.
Looking at the annotation, it became clear that this difference in classification stemmed from the use of different syntax. "Because of alternative transcripts and HGVS representations, this variant could be searched by multiple expressions," the authors wrote. Using one expression, someone could immediately pick up the relevant literature to classify the variant. But using others, including the correct HGVS syntax, would not return any relevant results.
Tools and databases used for clinical diagnostic purposes should be subject to rigorous scrutiny, according to Yen and her colleagues. Given that their results showed that some of these resources — in this case COSMIC and Variation Reporter — don't always conform to HGVS nomenclature, it should be a signal to labs that they should be paying close attention.
Sharing experiences and findings of similar errors or issues will be important for improving concordance across laboratories as a community, they added.
One thing that could help with this is the Personalis team's ground truth variant set, which they believe they have constructed to stringently test the limits of HGVS annotation tools. The authors wrote that they encourage other labs to use the dataset as a quality assurance to evaluate their own in-house annotations.
"With this report, we really wanted to try to quantify this and to point out real cases of this happening. If you are not 100 percent, it's not sufficient for a clinical diagnostic lab, and we found the tools were 10 percent, sometimes 20 percent wrong, so we wanted to get that number to the community," Yen said.
"The presence of duplicates in over one third of the COSMIC VCF highlights the importance of using tools for normalization to reconcile the multiple possible positions representing a single variant. At the level of HGVS, we found that the syntax produced by the tools was far more reliable than the syntax in the ClinVar and COSMIC databases … However, none of the non-SNV variant types were annotated with near 100 percent accuracy or compliance with HGVS conventions by either tool or database," the team wrote.
"Given the meticulous reporting requirements of a clinical genetics lab, this is concerning and suggests that it remains critical to manually review the syntax when reporting non-SNVs," they added.
"The take home message is really that if you are sharing data to the public domain, you have to be aware of these HGVS standards so your finding has the greatest reach. Because if it’s inaccurate, people are going to propagate that inaccurate information, and for clinicians, you have to be aware of this variation so you don't miss anything," Yen said.
The impact of improper translation from genomic to protein syntax is not necessarily only a problem for clinical diagnostics, she added.
In the growing immunotherapy space, there are multiple efforts to test methods for developing personalized cancer vaccines based on identification of tumor neoantigens.
"Because that also relies on translation from genomic to protein [data], if you do that wrong, you are going to have an ineffective peptide vaccine," Yen said.
While Yen and her coauthors focused on open source annotation tools, and implications for community consensus, many companies — Myriad Genetics for example — have resisted efforts toward more open sharing of genomic variant data.
On its own website, Personalis highlights proprietary aspects of its own analysis pipeline that play an important role in providing accurate variant classification. This includes private or exclusively licensed manually-curated databases that the company says provide "high-quality structured data to power our downstream analytics."
But Yen said that the purpose of her and her colleauges study was not to blast open-source tools or public databases like ClinVar and COSMIC.
In an email, Yen said that by presenting its findings to the community, Personalis hopes "to raise awareness of the issues and demonstrate that we are actively trying to solve these problems."
"We agree that data sharing is important not only for consistency but also for better interpretation and diagnosis. Many companies, including our own, are moving towards mechanisms to share this data without compromising patient confidentiality and privacy," she wrote.
"We are only at the tip of the iceberg of sequencing data that will be available in the next few years — if we don't begin to address these annotation issues now, it will become a pain point for everyone."