Several new tools and methods for phenotype analysis are emerging to facilitate systems biology ventures seeking to add an analytical quality to the centuries-old tradition of describing biology and disease by its physical traits.
Research teams in academia and industry in the US and Europe are building resources to mine and analyze phenotype data to better frame them in the context of genomics data and high-throughput experimentation. One group, led by Bayer Schering Pharma in Berlin, Germany, is creating an open access comparative phenomics computational tool for the publicly available PhenomicDB phenotype database. The new functionality will let scientists connect genotype and phenotype data across for seven animal, eukaryotic microbial, and plant species.
Another group, a collaboration between scientists at the University of Chicago, Columbia University, and the Jackson Laboratory, is expanding a database called PhenoGO from two species and to eleven species. The database links phenotypes, gene products, and Gene Ontology terms. A paper describing the database is currently in press at BMC Bioinformatics.
In addition, a European project called Gen2Phen kicked of earlier this year with a budget of close to €12 million ($17 million) with the goal of developing technology and methods to standardize and integrate genomic and phenotypic knowledge on human and model organisms with a particular focus on genetic variation databases [BioInform 02-01-08].
That effort includes, for example creating data search and data presentation options to better link genes to phenotypes and diseases.
“We are watching how Gen2Phen will develop and to see whether it will help us in our venture,” Bayer Schering’s Bertram Weiss told BioInform. Weiss is a bioinformatician and senior scientist in the company’s global drug discovery team whose work focuses on target identification and validation.
Phenomics, he explained, is not just about returning to the days of Gregor Mendel’s observations, but coupling observational techniques, data, and computational tools to enable, for example, drug discovery.
“We have the impression that phenotype data are a neglected resource,” Weiss said, noting that this has not been a focus for bioinformatics. “Little or nothing has been done with this data and there was no good resource or database where you could look up all phenotypes for all genes.”
“Most people are doing high throughput at the molecular level, not at the phenotype level,” Yves Lussier, associate director for informatics at the University of Chicago Cancer Research Center and director of the Center for Biomedical Informatics, told BioInform. Comparative phenomics is a way of doing “forward genomics to help with the evaluation of high-throughput experiments and “looking at all phenotypes and imputing the underlying mechanism.”
Originally, Lussier and his colleagues developed PhenoGO as a resource for human and mouse phenotypes. With their upcoming publication, the researchers will describe an expanded version of the database that now covers a total of eleven species and includes gene-disease specific annotations. The database contains over 600,000 phenotypic annotations, which have been derived through natural language processing and other computational methods from five GO annotation databases. It uses an expert system built in Prolog.
As a resource it is the first automated mapping of phenotypes to GO annotations, Lussier said. The researchers use BioMedLEE, an NLP engine co-developed by Lussier, to extract and encode genotype-phenotype relations from text. The software is an adaptation of a medical language processor called MedLEE used to extract and encode patient data.
The database also uses a system developed by Lussier called PhenOS, or Phenotype Organizer System, a computational terminology system that maps ontologies of different species and organizes phenotypes across heterogeneous datasets, bridging the gap between them.
“What our system does in high throughput, what it creates, is a specialized network for gene ontology and provides the phenotypic context,” Lussier said. Phenotype mining allows scientists to grasp the cellular, tissue-based, or organ-based context, as well as genomic context of a given phenotype.
As part of their study, the scientists evaluated 300 phenotypes showing a precision value of 85 percent, which he called “near-human accuracy,” and a recall or sensitivity value of 75 percent. The recall rate means “that if something is mentioned we have three out of four chances to bring it back, which is pretty good, because we are doing all of GO and there could be many phenotypes for each gene ontology annotation of a gene,” he said.
PhenomicDB is taking a different approach to gathering and presenting phenotype data.
The development team, led by Bayer Schering’s Weiss and Philip Groth, a PhD student at Humboldt University’s Knowledge Management in Bioinformatics department, is currently creating a new tool for the database that will help researchers link phenotypes and genes across species.
“We have the impression that phenotype data are a neglected resource.”
“We have just built a prototype that is still pretty rudimentary but which can cluster [phenotype data] in this way,” said Weiss.
“We are thinking about placing it back in the public domain so that it can be included as a functionality for PhenomicDB,” he said.
Once the module is available, users will be able to enter a gene or a phenotype and with one click gain access to all similar phenoypes. “That kind of tool does not really exist yet, certainly not in a cross-species implementation,” Weiss said.
The team has published its method and validated it but does not have a publicly implemented version yet.
What it ultimately could lead to, said Weiss, is “you have a phenotype you are interested in, you click on a button that will probably be called PhenoSim, for similar phenotypes, and that yields all phenotypes that are similar to the one you entered.”
Another possibility, he said, is that the software would deliver “a graphic visualization of the results showing how close the phenotypes are.”
With PhenomicDB, Bayer Schering Pharma decided to create a public resource rather than just a proprietary one for the company’s own use. “We decided to do that, because these are public data and if we take data from the public domain, we wanted to return it to the public domain, perhaps in better shape,” Weiss said.
Weiss explained that he, Groth, and colleagues at bioinformatics company Metalife created PhenomicDB by gathering and semantically integrating phenotype data into a single database schema.
Michael Schönemann, Metalife’s CEO, is an entrepreneur on extended leave from his biomedical informatics post in the radiology division of the University of Freiburg’s medical school in Germany. He told BioInform he has founded, run, and sold a total of fifteen IT companies. “My advantage may be that I saw developments a bit earlier than others and was able to fill software needs in specialized market areas,” he said.
Schönemann founded Metalife in 2000 as a bioinformatics services shop. The company received start-up support from Unisys, Intel, and Microsoft, and venture capital groups he did not identify.
Metalife owns and runs PhenomicDB as its only open access public resource.
The database is financed by Metalife but was built following Weiss’s idea who remains a scientific advisor and mentor. “We did all the software programming,” Schönemann said. His company continues to update the system and is implementing the new cross-species comparative phenomics functionality.
Metalife has been downloading all publicly available biomedical databases to create an integrated database, and PhenomicDB is essentially a small computational extract of that database, Schönemann explained.
PhenomicDB data is drawn from a wide variety of sources such as OMIM, FlyBase, the Mouse Genome Database, the Zebrafish Information Network, WormBase and the Comprehensive Yeast Genome Database. The database is restricted to phenotypes for which there is a clear genotype-phenotype link with a clearly implicated gene locus. Although the database includes human data from OMIM it does not include genome-wide association studies or other patient data.
The team mined textual descriptions of phenotypes leaving out any reference to genes but retained the reference to the gene associated with that phenotype. They stemmed the phenotype descriptions and created what they called an “adjusted” kind of phenotype description or phenodoc, said Groth. Using the vcluster algorithm from CLUTO v2.1.1, “the phenodocs were vectorized according to a method called TFIDF [term frequency, inverse document frequency], statistically analyzed, and clustered,” he explained.
Text clustering leads to groups of similar phenotypes. The scientists studied the clusters via protein-protein interactions, Gene Ontology terms, and looked at the co-occurrence of genes known to produce identical phenotypes, so-called phenocopies.
“Finding phenocopies in a cluster is a hint that the clustering method has led to a grouping of biological importance,” Groth said. A cluster of phenotypes may have similar genomic features or may just be phenocopies, similar phenotypes with differing underlying genetic mechanisms.
To figure out to what degree the computationally generated phenoclusters show biological relevance, the scientists analyzed the results and found, as they said in their paper, that their method is “a novel and promising way of finding relationships between genes with high biological coherence.”
“Applying other measures of similarity such as GO terms, which are not connected to phenotypes, we showed in a statistically significant fashion that a large proportion of the cluster is connected to genes that are linked biologically as part of the same pathway or protein-protein interaction networks,” Weiss said.
“That is our goal — to use clustering to bring together similar phenotypes in a group, to identify the associated genes, and to then find out which genes are shared, to see if they for example are part of a pathway which may have led to a similar phenotype,” said Weiss.
Is it Druggable?
In drug discovery comparative phenomics carries practical importance. “If you have a pathway with a molecular target of pharmacologic interest and in the known sections of the pathway, let’s say five proteins, none are druggable, then you might be interested to find other members of the pathway that no one has explored yet,” he said.
This method has a scientific discovery aspect to it, allowing scientists to generate novel hypotheses, he said. “It isn’t perfect, it leads to false positives, and only around one fifth of all phenotype clusters show biological coherence,” he said. The other four fifths fall through the text mining gaps because the textual descriptions are too short for the text mining system to grab them.
The method is a bit like finding related articles in a PubMed search, said Weiss.
“You are looking for abstracts that more or less have the same words and that is pretty much what our method does.” The new PhenoSim functionality will expand the value of the database for users, Weiss said.
It will continue to use the same algorithm. “We were looking for a good, stable algorithm in public domain that is seen as valid.” Cluto is “well-documented and validated and is in the public domain,” said Groth.
In term of practical applications, Weiss outlined a simplified example. Knocking out individual players in a pathway may very well deliver the same phenotype in experiments. With text mining combined with phenotype data analysis, a drug discovery researcher could explore a pathway via similar phenotypes. “For some of those phenotypes the functions of the genes are still unknown and that might well be a great druggable target,” he said.
It was this type of quest that led him to work on PhenomicDB. For example, in contraception research where cell-based assays are lacking, he and his colleagues were seeking ways to mine phenotypic knowledge about model organisms.
RNAi technology has delivered the high-throughput method that scientists have been waiting for, he said, letting them knock out genes one at a time and study phenotypes. “It is a method that lets us create phenotypes at a much larger scale,” he said, and will necessitate more computational tools for comparative phenomics.
“In analyzing RNAi screens there is a tendency to pick out just one target of interest that is associated with a phenotype and the remaining thousands of phenotypes fall by the wayside and may not even be published,” Groth said. These data can be mined and put to good use, for example, in cross-species comparisons.
Weiss explained that he and his colleagues are still working on text-mining methods to better mine RNAi data, for example. Part of that challenge lies in the need for more ontologies, he said. “That is not to criticize the ones that already exist, but we need a better phenotype ontology and disease ontology and easier access to the data,” he said. A phenotype can be a numeric value, not only a text description, he noted.
Phenotypes can be the results from blood tests, or they can be behavioral descriptions, such as from open field experiments in mice, which leads to very heterogeneous data. “That makes it very difficult to integrate and that also limits the way methods can comb through these data types together.’
Lussier explained that he and his colleagues, too, are working on methods to allow PhenoGO to find RNAi data and miRNA data. “MicroRNAs are interfering with the translation of many genes at once and thus switching the characteristics of different cell types on and off.”
Framing the Facts
“Having a more phenotypically characterized context for genomic databases such as PhenoGO will drive a new era of systems biology, having more phonotypic hooks on which to hang the facts,” Lussier said.
PhenomicDB and PhenoGO are complementary, he said, though PhenomicDB is perhaps more geared for an audience of biologists. “We are targeting systems biologists and more computational people,” he said.
He admits that PhenoGO does not have the breadth of PhenomicDB. “But it has the depth in that one area that [PhenomicDB] doesn’t have.” The depth, he said, comes from using gene ontologies. “The advantage of PhenoGO over PhenomicDB is that you have cell type annotation over a large number of Gene Ontology annotations,” he said.
For example, the Foxn1 gene, which is used in cancer research, has 24 GO annotations, including impaired morphogenesis, limited T lymphocyte function, and lack of keratinization. “All of that in one cell type makes no sense, [but] at the gene level that is what we get. What I am doing, and what PhenomicDB cannot do, is to get more depth at the gene ontology level,” said Lussier.
However, he acknowledged, “the number of users [of PhenoGO] is probably a handful as compared with PhenomicDB.”