During last week's Intelligent Systems for Molecular Biology conference in Vienna, Austria, organizers of the Critical Assessment of Genome Interpretation project attempted to drum up interest in the community experiment by highlighting one of the CAGI challenges, which focused on predicting human phenotypes from genomic data.
As part of a special session at ISMB focused on the project, CAGI organizers discussed one of several CAGI 2010 challenges in which participants were provided with genomic data from 10 individuals from the Personal Genome Project and had to predict their probability of having any of 40 phenotypes on a list that included binary traits such as asthma, Crohn's disease, absolute pitch, and tongue rolling; as well as numerical or quantitative traits, such as birth weight and triglyceride levels.
However, since there was only a single entry for the PGP challenge in CAGI 2010 — from a research lab at Johns Hopkins University — CAGI's organizers are planning to include the PGP data in the CAGI 2011 challenge as well. Data for CAGI 2011, comprising seven separate challenges, was released in late June and submissions are due in September.
Researchers at the University of California, Berkeley, and the University of Maryland launched CAGI last year as a community experiment that aims to evaluate the effectiveness of computational methods used to make predictions about the impact of genomic variants on phenotypes (BI 11/12/2010).
Participants were expected to use a variety of methods to make computational predictions of phenotypes based on genotype data. A group of assessors then compared the predictions to the correct results for each of the datasets.
In addition to the PGP challenge, participants were expected to predict how amino acid mutations affect the function of an enzyme, estimate the probability of an individual with a given mutation being in a cancer or control cohort based on variations in two genes associated with breast cancer, and to predict the effect of cancer rescue mutants — mutations that reactivate suppressed cancer-causing genes.
Sean Mooney, an associate professor and director of bioinformatics at the Buck Institute for Research on Aging who was tapped to assess submissions for the PGP phenotype prediction challenge, said during his ISMB presentation that human phenotype prediction is "one of the grand challenges of what we are trying to do with genome interpretation."
Although phenotypes are affected by both genetic and environmental factors, CAGI participants are exploring only the genetic component.
Traditionally, geneticists have used markers that have been identified in genetics studies to make predictions about phenotype or to identify the probability that a certain individual would have a phenotype.
However, Mooney said it isn't clear how accurate these methods are because the outcome depends on several factors, such as how much impact genetics has on a given trait relative to environmental factors.
"Here is the first opportunity we have ever had to actually take real genomes and assess how well [prediction] methods work and what methods work to try to identify what we might observe about the genetics of a person," he told BioInform.
He added that although the PGP data set is "relatively modest" — it comprises only 10 genomes and about 40 phenotypes and traits — "I think there is an opportunity here to grow as we start getting more and more genome sequence information."
Because the PGP datasets will be used in CAGI 2011, Mooney did not reveal the results of the CAGI 2010 assessment.
"I think now after it has been done successfully for one year ... we have done the right thing, which is to not release the data and hold the assessment again and see whether other teams submit, and I think there will be a lot more submissions," he said.
The single submission for the PGP challenge in CAGI 2010 came from the lab of Rachel Karchin, an assistant professor in JHU's biomedical engineering department.
Karchin's lab provided some details about its approach via video during an ISMB session.
To predict the probability of each individual having one or more of the phenotypes in the list, the team used a two-layered Bayesian network to build a probabilistic model.
Hannah Carter, a PhD graduate student in Karchin's lab, explained that the first layer of the model contains the genes and the second layer holds the phenotypes or outcomes with edges between genes and outcomes representing a causal relationship between two variables.
Due to time constraints, Carter said the team focused on two types of genetic variations in its final model: significant variants, which are associated with phenotypes on the list; and effective genes, which have variations that are "predicted to impair the activity of the gene's protein product."
The team then used this information to create models for each of the 10 genomes in the dataset to predict their individual phenotype probabilities.
Although the group participated in other CAGI 2010 challenges, "this one is particularly interesting," Karchin told Bioinform.
"There is a huge bottleneck right now between the ability to sequence thousands of genomes and exomes and the ability to interpret it biologically ... and make use of it clinically," she said.
She pointed out that although direct-to-consumer genomics companies such as Navigenics and 23andMe offer proprietary methods for predicting disease risk based largely on SNP genotypes, there currently aren’t any good computational methods to predict human phenotypes from genomic data that are reliable enough to be used in the clinic.
"It's such a new thing to be able to look at human variation on the whole exome level. Whereas a few years ago human variation had only been studied in a limited number of genes ... this is a whole new world with a whole exome," she said. "Nobody knows yet how to interpret whole exome variation and it's really challenging so we [were] motivated to try to jump in and do something."
Have topics you'd like to see covered in BioInform? Contact the editor at uthomas [at] genomeweb [.] com.