Researchers at Cincinnati Children’s Hospital Medical Center and the University of Cincinnati Academic Health Center have developed a new database to help researchers keep pace with the growing number of SNPs in the public domain.
The database, called PolyDoms, integrates information from a number of SNP, protein, and pathway resources and includes precalculated predictions of coding SNPs that are likely to have an impact on human disease. The researchers expect that PolyDoms will be useful for prioritizing candidate SNPs for testing in large-scale association studies.
The database was originally developed to support the Comparative Mouse Genomics Centers Consortium, an effort funded by the National Institute of Environmental Health Science to develop knockout mouse models to help researchers study the effect of human polymorphisms on environmentally induced disease.
“We had to make decisions about what polymorphisms might be placed into mice to develop mouse models, and when you commit a certain polymorphism to making a mouse model, you’re going to invest a lot of money in it, so we really needed something that could integrate all the information available in a fair way,” Bruce Aronow, co-director of the Center for Computational Medicine at Cincinnati Children’s Hospital and a co-author on the paper, told BioInform this week.
“We started out with a need to prioritize SNPs that we knew existed in human populations that should be considered as candidate cancer risk genes,” he said.
But winnowing down the entire set of known polymorphisms — the current version of dbSNP, build 126, contains nearly 12 million reference SNPs — to a workable number for an experiment is a complex task for most researchers, Aronow said.
As an example, he noted that a researcher looking for polymorphisms associated with DNA damage repair would have to scan through multiple databases on genetic associations, and would then need to retrieve pathway information, disease association data, and other information such as mouse phenotypic data from numerous other resources.
“What we’ve done is kind of stitched together a data structure where you can navigate across all these concepts simultaneously and go shopping, and then prioritize everything that comes back in your shopping basket based on structural impact analyses,” he said.
Running the same analysis workflow that the Cincinnati team developed for PolyDoms might be possible for a single protein or a few polymorphisms, Aronow said, “but doing it for the whole genome — that’s prohibitively expensive computationally.”
The PolyDoms team used a 10-processor Itanium 2 cluster at the Ohio Supercomputer Center to run its analyses, and even then it required “weeks” to run, he said.
The key to the resource is its use of two homology-based algorithms — SIFT (Sort Intolerant from Tolerant) and PolyPhen — to predict the impact of non-synonymous SNPs on protein function. After running around 45,000 non-synonymous SNPs from dbSNP through the two algorithms, SIFT predicted around 14,800 to be “deleterious,” while PolyPhen predicted around 14,600 to be “damaging.” Around 9,000 were predicted to be both damaging and deleterious, indicating a concordance of around 60 percent between the two programs.
“I think that if you have multiple indications of likely harmful impact you can be very sure that that amino acid is going to be debilitating to a particular area of the protein.”
“I think that if you have multiple indications of likely harmful impact you can be very sure that that amino acid is going to be debilitating to a particular area of the protein,” Aronow said, though he noted that “whether that area of the protein has a net affect of really changing the function of the protein in the context of its interactions and pathway participation — that’s another step that’s not quite handled by PolyDoms.”
PolyDoms provides a visualization tool that maps coding SNPs onto 3D protein structures and highlights non-synonymous SNPs that are potentially damaging or that have been previously reported as disease alleles. Users can query the database by protein, pathway, gene ontology term, disease term, or gene family.
Aronow described the database as an important, but early, step in integrating genotype with phenotype. “This is one of the first applications that’s ever incorporated mouse phenotype data from Jackson Labs back over into human orthologs to help predict the disease impact for humans,” he said. However, he noted, “phenotype is very, very hard. Disease is very difficult, so where you draw the line on what’s a normal phenotype variation versus true disease just gets harder and harder the more you look into medicine.”
Nevertheless, he noted that “many people in the field are coming up with creative ways of coming to terms” with the challenges of merging genetics and human health, “and I think the informatics is just getting to be more and more powerful to help to do this.”
Ultimately, Aronow said, PolyDoms represents one step toward his group’s long-term goal of modeling the effect of genetic perturbations on disease phenotypes. He described the database as enabling “knowledge representation and abstraction and functional annotation in an integrated way, which is a necessary precursor to useful dynamic systems modeling applied to disease. So this is a big step for us.”
A paper describing PolyDoms was published in this month’s Nucleic Acids Research database issue.