The European Molecular Biology Laboratory's European Bioinformatics Institute this week launched a new database designed to enable scientists to search and compare gene-expression data according to cell type, tissue, and disease conditions.
The database, called Gene Expression Atlas, is an evolution from EBI's ArrayExpress gene-expression data archive. Users of the Atlas will be able to search through curated and annotated expression data from about 5,000 different experimental conditions collected since the late-1990s in nine different species from 1,000 different datasets based on a total number of 30,000 assays, according to Misha Kapushesky, Gene Expression Atlas project leader.
EBI, based in Hinxton, UK, has been collecting gene-expression data generated by microarray experiments since 1998, and in 2001, the institution decided to make the resulting archive publicly available as ArrayExpress. The Gene Expression Atlas, accessible here, is the culmination of a project to make that data more useful to researchers, Kapushesky said.
"We have a wealth of publicly available data and we realized about five years ago that we weren’t doing anything with it," Kapushesky told BioArray News this week. "We decided to repurpose this data for other goals. For researchers, you need to know where in the body genes are expressed and in which context," he said. "This is important in areas like drug targeting or fundamental research about regulation."
ArrayExpress, said Kapushesky, is "just an archive" that users can query on the type of experiment or the laboratory where it was performed or the platform that was used. "You can get these experiment-centric searches and download uncurated data in zip files," Kapushesky said. Such data is useful for experts, he said, but not for most biologists.
To create Gene Expression Atlas, EBI took a curated subset of the ArrayExpress data, reannotated it, sometimes by contacting those who had conducted the experiments, and ran data quality controls to enable users to ask "gene-centric questions."
The data was also mapped to an ontology, so that data generated from, say, brain samples can now be searched in the context of related studies based on ontological classification. Users can now query gene expression under a range of biological conditions, including different cell types, developmental stages, physiological states, phenotypes, and disease states.
"If you wanted, Gene Expression Atlas could show you the top five experiments where a particular gene was shown to be most active in liver," Kapushesky said. "The fact that the tool is gene-centric and can query on specific conditions, genes, pathways, and ontologies allows researchers to use any number of attributes to slice through the data."
While all of the data currently in the Atlas was generated on microarrays, he said that EBI is now preparing digital gene-expression datasets generated by second-generation sequencers that will become available later this year.
In addition to DGE data, EBI is considering adding microRNA expression profiling data. Efforts are also underway to bring proteomics and metabolomics data into the mix. According to Kapushesky, EBI is in discussions with the Human Proteome Atlas to integrate data into Gene Expression Atlas.
That project, hosted by the Swedish Royal Institute for Technology and Uppsala University, aims to have a complete map of the human proteome generated by 2014 (see BAN 5/20/2008).
Kapushesky said that EBI will update the data in the Gene Expression Atlas on a monthly or bimonthly basis. While DGE data should be included in the next release of the resource, he provided no date for when the metabolomics or proteomics data could become available. "It will happen, but they are doing a lot of data validation studies at the moment," Kapushesky said of EBI's dialog with the Human Proteome Atlas project. "There is an exchange of data going on and we and are talking actively with them."
Kapushesky's group developed the Gene Expression Atlas using R and the Bioconductor statistical package. A curation team dealt with the curation and reannotation of the data, while a software-development team created a web interface for the resource using binary indexing and the Oracle database. While most users can access the database through this interface, Kapushesky said that bioinformaticists may be able to access the data via their own interfaces if necessary.
[ pagebreak ]
Aside from ArrayExpress, there are a number of databases that researchers currently use to survey archival gene-expression data. For example, Compendia Bioscience-hosted Oncomine enables users to query cancer-related expression data from nearly 30,000 array experiments; the Swiss Federal Institute of Technology Zurich hosts Genevestigator, which offers annotated but uncurated data from around 30,000 experiments; and the US National Center for Biotechnology Information's Gene Expression Omnibus catalogs array data by platform, sample, and study. Kapushesky said that Gene Expression Atlas' ability to provide ontological data might attract users who would otherwise use these existing tools.
'Roll Your Own'
Even before the Atlas officially launched this week, pharmaceutical companies had an opportunity to use it. Kapushesky said that the resource has been in alpha development for "some time," and that there have been several meetings and workshops at EBI where the institute's industry partners were introduced to the project.
Recently, Pfizer requested to use the Gene Expression Atlas in-house and Kapushesky said that the pharmaceutical firm has been funding the creation of a standalone, downloadable version of the tool that would allow companies with large repositories of their own data to integrate it into the current offering.
"You'll be able to take the software and roll your own," said Kapushesky of the Pfizer-funded work. "When that work is complete, the results … will also be available to the community. If you have your own collection of data, right now it is impossible to look at it in Gene Expression Atlas."
Kapushesky said the standalone version of Gene Expression Atlas should be available by the end of the year.
One reason why pharma researchers have taken an interest in the resource is that they can search the EBI data archives by biochemical compound responses. "You might want to know what genes are affected or what the response is to certain compounds," said Kapushesky. While the tool is not restricted for research in any particular area, those studying oncology will certainly find it useful, he added.
"Cancer certainly does come to mind, but anybody who is interested in transcription or knowing where part gene of interest is active transcriptionally should use this," he said. "Whether you are a bench biologist who gets new gene candidate and wants to know where it is active, or a bioinformatician working in an R&D department at a pharma company. You can have set of leads and use the atlas to find out if there are common transcriptional effects."