Separate research teams from the Wellcome Trust Sanger Institute and the European Bioinformatics Institute, and from the University of Washington and the HudsonAlpha Institute for Biotechnology have published papers in Nature Methods and Nature Genetics, respectively, that describe complementary approaches for predicting the pathogenicity of variants in coding and non-coding parts of the genome.
Both teams were motivated by a communal need for informatics applications that are capable of analyzing much more than just mutations that alter protein coding, and tools that are able to use the plethora of annotations that projects such as ENCODE provide to make decisions about variants' functions. The methods are similar in the sense that both make use of machine learning techniques; the Genome Wide Annotation of Variants (GWAVA) algorithm — described in Nature Methods — is a modified random forest algorithm, while the Combined Annotation-Dependent Depletion (CADD) framework — discussed in Nature Genetics — incorporates a support vector machine in its workflow. But they are designed for slightly different purposes. GWAVA scores regulatory variants, which are found in non-coding portions of the genome, while CADD is designed to score variants in both coding and non-coding genomic regions.
According to the Nature Methods paper, in developing GWAVA, the researchers aimed to "use a wide range of variant-specific annotations of different classes and at a range of genomic scales to investigate if a combination of regulatory annotations, genic context, and genome-wide properties can be used to identify variants likely to be functional." The reason for combining the annotations, the paper explains, is that although functional regulatory variants have a different distribution of the aforementioned annotations than a control set of variants would, "on their own these differences are insufficient to allow us to discriminate functional variants from controls with reasonable precision."
To identify which annotations would work best, Graham Ritchie, a postdoctoral research fellow at the EBI and WTSI and one of authors on the Nature Methods paper, explained that the team first pulled non-coding variants from the public release of the Human Gene Mutation Database and then annotated them with information from ENCODE, GENCODE, and other sources. They also selected three control datasets from the 1000 Genomes project — the first was a set of randomly selected variants from across the genome; a second set was made of variants that were "matched for distance" to nearby genes that were not necessarily ones close to HGMD variants; and a third set was made up of variants actually located near the HGMD variants.
The next step was to take the GWAVA algorithm through three training rounds, where each round compared the same list of HGMD variants to one of the three control datasets. At the end of training, GWAVA identified annotations such as GC content, evolutionary conservation, and DNase1 hypersensitivity, as the most informative for distinguishing between pathogenic and benign variants, the researchers wrote. To make sure GWAVA's results weren't unique to HGMD data, the researchers ran multiple tests using new sets of non-coding variants including one that they obtained from the National Center for Biotechnology Information's ClinVar database. Results reported on this particular study in the supplementary section of the paper indicate that GWAVA performed well on the new data.
To make things easier for prospective users, the EBI and WTSI researchers have pre-computed scores for all known non-coding variants that are available in Ensembl — the software scores variants on a scale of 0 and 1 where the latter number means that the variant in question is likely pathogenic — and they make these available via GWAVA's website. Alternatively, it's possible to download the free code and run the software locally — a helpful option for those who might have novel variants not currently included in non-coding canon.
Ritchie also said that his team is considering adding capabilities to GWAVA that will make it possible to score coding variants. Right now, researchers using GWAVA have to use a separate tool — PolyPhen is one option — to score coding variants, so a unified framework that takes care of both activities is appealing. However, it's not clear if that’s the best approach to take, according to Ritchie. Part of the problem, he said, is that there is a possibility that researchers might lose some information when they use a unified framework to score both sets of variants, and if that’s the case, it might be better to keep using separate software for each task. However that’s still "kind of an open question," he said.
The CADD framework, meanwhile, provides a method of integrating diverse annotation into a single metric — a so-called C score for variants — that measures "deleteriousness, a property that strongly correlates with both molecular functionality and pathogenicity," while avoiding the disadvantages and limitations associated with approaches that focus on both, according to the Nature Genetics paper, which describes its development and testing.
At its core, CADD is based on the evolutionary principle that harmful mutations are edged out of the gene pool over time via natural selection and, therefore, that variation that has not been selected against is less likely to be deleterious. Working off that idea, "we can … look at patterns of mutation events and ask what is it about a mutation event in terms of its annotation that might tell us whether it was selected against," Gregory Cooper, a HudsonAlpha faculty investigator and co-author on the Nature Genetics paper, explained to BioInform. To do that, he and his colleagues designed a simulator that they used to generate roughly 15 million mutation events based on known rules about how these events occur in real life. They annotated the data with 63 types of annotation including conservation metrics, regulatory and transcript information, and protein-level scores, and then compared that to the 15 million variants that studies have shown are fixed in the human genome — also annotated in the same way — looking specifically for mutations that do not appear in the observed variant dataset.
The simulated dataset, Cooper explained, represents all the variants that would likely be present in the human genome — whether they had an ill effect or not — had the genome not been subjected to natural selection. For each annotation, CADD compares both sets of variants and asks how that particular annotation feature differs between the two datasets. So, for example, "in our set of 15 million simulated variants there is something like 8,000 stop codons, [but] when you look at the observed variants there's something like 100 real ones," he said. "That … deficit is likely the effect of natural selection," because "if there was no natural selection in humans we would expect to have observed about 8,000 stop codons." That difference "is the property we are going after," he said.
The researchers ran this comparison process for each of the 63 annotations and then trained a support vector machine using the approximately 30 million simulated and real variants and features derived from the annotations, according to the paper. The SVM quantifies for each annotation "how strongly it separates the observed and simulated variants," Cooper said, and it computes a single score based on all the relevant annotations for each variant that represents the likelihood that the variant is deleterious, although it may not necessarily be pathogenic.
One of the benefits of this approach, he said, is that there is no limit to the number of new annotations that could be added to the system to make the CADD scores better — an important factor in its favor as projects like ENCODE continue to generate useful data for annotation. He and his colleagues are already on the lookout for more annotations to incorporate into the system including ones that pertain to regulatory variants, where the current "paucity" of data "limits that development of better annotations, as well as our ability to validate predictions," the researchers wrote. Furthermore, "the one-stop nature of CADD is likely to be of great practical and conceptual value to future sequencing studies" because "it will minimize the scope and diversity of annotations that have to be generated, tracked, and evaluated by a laboratory or project and will reduce the need for ad hoc combinations of filters, scores, and parameters as is now routinely carried out."
Like their colleagues at EBI and WTSI, the HudsonAlpha and UW researchers provide pre-computed scores for all possible single nucleotide mutations that could occur at every position in the genome — about 9 billion in total — so users don’t have to repeat the development process described in the paper, but for those who want to, the researchers have made their code freely available for download. There are, however, no pre-computed scores for indels — since the list of possibilities would be much too large to generate up front. For those who need to score indels, the team has put together some code that they offer for that purpose, Cooper said.