NEW YORK (GenomeWeb News) – Researchers from the University of Southern California and the University of California, Berkeley, have come up with a computational approach for translating publicly available gene expression data into a database for diagnosing disease.
The researchers verified the approach, described in a paper scheduled to appear online this week in the Proceedings of the National Academy of Sciences, using data from the NCBI's Gene Expression Omnibus, or GEO, representing more than 100 disease classes. And by developing a map of drug-disease interactions, they illustrated how a similar method can be used to tackle other complex phenotypes.
"As the first attempt to turn a public expression repository into an automated disease diagnosis database, this study provides an important application for the growing mass of costly yet freely available gene expression data," senior author Xianghong Jasmine Zhou, a molecular and computational biology researcher at the University of Southern California at Los Angeles, and her colleagues wrote.
The team is also led by Haiyan Huang, a statistician at the University of California at Berkeley.
Gene expression studies are yielding a wealth of information on expression patterns in various cell types under a range of conditions, including disease states. But while much of this information is housed in public databases such as GEO, the researchers explained, gaps remain in routinely applying this information clinically.
"The majority of the vast amount of gene expression data in the GEO is related to disease studies. However, the scale of expression-based disease diagnosis is so far limited," Zhou told GenomeWeb Daily News in an e-mail message.
"[W]e think if we could overcome the expression/phenotype data heterogeneity to scale up the diagnosis, we can then transform the GEO database into a diagnosis database," she added, calling this a "significant and natural use of GEO."
In an attempt to do just that, the researchers came up with a two-stage probabilistic framework based on expression and phenotypic data. They then tested this approach using GEO expression data generated through 9,169 human microarray experiments related to 110 disease classes.
To standardize the data, the team compared disease and normal profiles to create dimensionless vectors that weren't tied to a particular platform. Because of this platform independence, they explained, the vectors allow the use of expression data generated by various approaches, including RNA-sequencing.
"We can directly plug in the expression value provided by RNA-Seq into our system," Zhou noted. In the future, the researchers plan to expand the current model "to maximally benefit from RNA-Seq's high quality data," she added.
By posing the disease diagnosis question as a hierarchical multi-label classification problem, Zhou explained, they were able to bring together standardized gene expression and hierarchical disease information. Meanwhile, the team's two-stage learning approach involved independent Bayesian disease classifiers and a Bayesian network model.
When the researchers tested the approach using expression data for specific conditions, including a muscular condition called Duchenne Muscular Dystrophy, they found fairly good predictive power overall, though the precision and recall of the tool varied by disease class.
In general, the predictive value improved for disease classes with more gene expression data available, leading the team to suggest that "the predictive power of our system is expected to increase significantly as public gene expression repositories continue to grow."
The researchers also integrated data from 1,248 drug-related queries to create a drug-disease connectivity map, uncovering 234 significant disease-drug connections. These included both new and previously reported interactions. For instance, the team detected an apparent interaction between the cancer drug doxorubicin with rheumatoid arthritis — consistent with a previously proposed role for the drug in treating arthritis.
Such connectivity maps may eventually inform studies looking for shared molecular features underlying disease phenotypes, the researchers noted. And, they say, a similar approach may also prove useful for exploring the underpinnings of non-disease processes, such as stress response and differentiation.
Zhou said the team is currently working on improving the predictive power of their framework and testing whether it's possible to use absolute expression profiles, such as those generated by RNA-Seq, rather than relying on comparisons between disease and control profiles. They also plan to build an online disease diagnosis database that can be used by other researchers.