Skip to main content
Premium Trial:

Request an Annual Quote

Variant Interpretation Algorithm Discordance Highlights Need for Caution in Using Results

Premium

NEW YORK (GenomeWeb) – Researchers from Baylor College of Medicine have attempted to assess the applicability of current American College of Medical Genetics and American College of Pathologists (ACMG/AMP) guidelines for algorithm use in characterizing variant pathogenicity.

In a study published last week in Genome Biology, the researchers sought to explore the performance of current variant interpretation algorithms particularly for missense variants in a systematic manner. 

Specifically, the researchers used over 14,000 benign or pathogenic missense variants from ClinVar to compare the predictive power of 25 commonly used algorithms. Although researchers identified several algorithms that they say couple high predictive power with a robust response to variables such as disease mechanisms, level of constraint, and mode of inheritance, their results largely showed significant disagreement between algorithms in terms of their ability to characterize variants as pathogenic and benign.

These results highlight the limitations of implementing the current ACMG/AMP guidelines for in silico algorithms, according to Sharon Plon, a professor in Baylor's departments of pediatrics/hematology-oncology and molecular and human genetics and one of the authors on the paper.

"One particularly interesting but somewhat troubling finding from the analysis is that the algorithms overall are more likely to call variants pathogenic than to call them benign," she said. "There are a significant number of variants that are benign as listed in ClinVar that every algorithm called pathogenic."

As noted in the paper, generating evidence by in silico methods is a routine part of assessing novel variants found in whole-exome or whole-genome sequencing experiments. In clinical settings, predictions from algorithms are included as one of eight evidence criteria recommended by the ACMG/AMP guidelines for variant interpretation.

However, many algorithms that are typically used for variant interpretation are often applied to data "without additional calibration," Plon et al wrote. Furthermore, different testing laboratories use different combinations of algorithms to interpret variants which further contributes to discordant interpretations.

Furthermore, there is little consensus among clinical labs on which and how many algorithms to use for missense variant interpretation, which are a major source of variants of uncertain significance in ClinVar. One exome sequencing study that the authors highlight, which sought to classify 180 medically relevant genes for hereditary cancer, reported a higher VUS rate when requiring full concordance versus majority agreement among the 13 algorithms that they used in their pipelines.

"We focused on this issue of concordance because the ACMG guidelines … says basically to use these algorithms if they agree when you are classifying variants," Plon said. "[We were] interested in this question 'well how much agreement is there and do different types of variants have more disagreement or not?'"

To determine the level of concordance among algorithms for known pathogenic and benign variants, the researchers took 14,819 missense variants from ClinVar that had been tagged as pathogenic or benign by at least one submitter and then annotated these variants with scores and predictions using the 25 algorithms. These algorithms were selected based on a search of the biomedical literature for articles related to clinical variant classification that stated which algorithms they used in their abstracts.

In one analysis, the researchers used the scores from 18 of the algorithms to classify the variants as pathogenic or benign based on publicly available thresholds of pathogenicity. According to the results, only 5.2 percent of benign variants and 39.2 percent of the pathogenic variants had concordance across all the algorithms. In some cases, the algorithms did not return a prediction for some of the variants.

When they repeated the experiment on a smaller dataset that contained about 8,300 variants, they saw a similar trend. Specifically, the algorithms only showed concordance for 3.2 percent of the benign and 41.5 percent of pathogenic variants. They also saw similar results when they restricted their analysis to variants labeled benign and pathogenic in ClinVar by at least two independent laboratories.

The researchers also found that on average, two algorithms tended to differ from each other significantly more in terms of the interpretation of benign as opposed to pathogenic variants. They concluded from this that "while interpreting large numbers of variants, full concordance as suggested by the ACMG/AMP guidelines is less likely to be achieved even when using only two algorithms particularly for benign variants."

Plon et al also explored the level of concordance among 18 of the most commonly used algorithms including Polyphen, SIFT, CADD, and others using about 7,300 variants. They found that predictions from five of these algorithms resulted in 79 percent concordance for pathogenic variants and 33 percent for benign variants. Moreover, about 10 percent of the variants in the dataset that were classified as benign were mischaracterized as pathogenic by five of the commonly used algorithms. Also, just under 1 percent of pathogenic variants were mischaracterized as benign by the five algorithms.

Overall, about 23 percent of the benign variants in the dataset were labeled pathogenic by about half of the 18 algorithms tested including about 87 variants that had three-star reviews in ClinVar. Of the known pathogenic variants in the dataset, just over 5 percent were characterized as benign by around half of the algorithms. No single combination of algorithms resulted in the false concordance of zero and true concordance of 100 percent among the 18 algorithms tested in this study, according to the researchers.

"No one assumes that algorithms should be perfect … [but] I think the proportion of variants from multiple different algorithms that were miscalled … was a little surprising," Plon said. "And I think that does highlight that laboratories need to use this data with caution."

The researchers also generated all possible combinations of three, four, and five algorithms and obtained their true and false concordance rates across the larger dataset of 14,819 variants. In practice, labs use anywhere from two to 10 algorithms in their variant interpretation pipelines and they have varying criteria that they use to accept or dismiss the results. For example, some labs have an all-or-nothing approach while others require only that the majority of algorithms agree.

According to the researchers' results, the best performing combinations of algorithms were different for benign and pathogenic variants. For example, combining the VEST3, REVEL, and MetaSVM algorithms resulted in the most accurate characterization of benign variants with a true concordance rate of 81.3 percent and a false concordance rate of 2.8 percent. For pathogenic variants, combining MutationTaster, Mcap, and CADD algorithms returned the best results. However, these combinations "are relevant only in the context of the particular dataset we used and may not be optimal across other designs," the authors caution.

They also grouped the algorithms into clusters based on criteria such as whether they relied on evolutionary conservation for example, and found that combining algorithms from, for example, three different clusters resulted in far more discordant predictions of pathogenicity.

These are some of the reasons why Plon and her colleagues do not recommend particular combinations of the algorithms that they discuss in the paper, opting instead to let labs make that decision themselves based on the questions that they are trying to answer.

Also, "we are a team of three investigators not a professional organization," Plon said. "We wanted to put the data out there so that it's publicly available for folks that are making those determinations." That includes, for example, investigators involved in developing and maintaining the ClinGen resource. Plon, who is one of the investigators on the project, said she expects that the group will issue some recommendations governing the use of algorithms in characterizing variants in the near future.

"The ClinGen sequence variant interpretation group, which I am a member of, is hard at work looking at each of the ACMG evidence codes and making recommendations to help labs that are using them," she told GenomeWeb. The group is relying not just on data from this paper but also from others such as this one published last year in the American Journal of Human Genetics that looked at the performance of the ACMG-AMP variant interpretation guidelines in nine laboratories. That study, which was done by the Clinical Sequencing Exploratory Research (CSER) consortium, similarly reported that algorithm use was a "major source" of discordance among different clinical laboratories and called for further recommendations from ACMG/AMP. 

"I think there is an effort within the genetics community to take this long list of evidence codes and do some exploratory work to try to find insights on what's the best way to apply them," she said. Ultimately, "any given laboratory has to decide what's the best method for the clinical test that they are offering."

Though no algorithm performed perfectly, the researchers noted that some seem to perform better than others. For this analysis, they examined the aforementioned 14,819 ClinVar variant that are assigned at least one star and a second batch of 2,966 ClinVar variants with at least a two-star rating. Their results, which are described in more detail in the paper, suggest that some of the newer algorithms which incorporate multiple different predictions to make their final assessment perform better than older algorithms that rely on individual predictions, Plon said.

"But our analysis of the literature, and my personal experience reading lab reports, suggested that many laboratories still use the older algorithms," she said. "So one of our other recommendations was for laboratories to look at their clinical pipelines and to consider whether updating the algorithms that they are using or using different combinations might be more effective." It's also important for labs not to overuse the predictions of these algorithms unless they've done some additional work to show that the algorithm works particularly well with that gene in question, Plon added.

The authors further encourage clinicians reviewing diagnostic reports that include results from algorithms to be aware of their variable performance as well as the problem of false concordance. This is an important issue for clinicians who may not understand the limitations of these algorithms when they receive patient reports that include computational characterizations of variants.

"I think the paper makes an important point that there are high-performing algorithms but there is a chance for error in particular in calling a variant pathogenic that is actually benign," Plon said. "When clinicians talk about reports, they sometimes weigh those predictions in their minds quite heavily. I certainly will incorporate that more into my teaching and my own evaluation of patient reports."