By Vivien Marx
This story was originally posted April 14.
A team led by researchers at the European Bioinformatics Institute has integrated and analyzed a massive microarray data set in order to map for the first time the global "expression space" of human gene activity.
The EBI researchers and colleagues from the University of Helsinki integrated array data from more than 5,300 human samples from 163 different labs, representing more than 350 cell and tissue types, disease states, and cell lines. Their analysis resulted in "a small number of distinct expression profile classes," the scientists wrote in a correspondence published last week in Nature Biotechnology.
Specifically, the researchers identified six distinct major groups, or "continents," of gene expression: brain; muscle; blood-related; healthy and tumor solid tissues; cell lines derived from solid tissues, and partially differentiated cells.
The results have been compiled in an online resource hosted at EBI that allows users to either search for a gene of interest and find the conditions in which it is over- or underexpressed, or find which genes are over- or underexpressed in a particular condition.
Alvis Brazma, senior team leader for microarray informatics at EBI, told BioInform this week that just as a continental map of the world will not help a user navigate the streets of a particular city, the global map of gene expression "will not help you very much to find the genes correlating with the survival rate of a particular disease."
However, he added, "it does show you how genes are expressed in various tissues affected by this disease, and how gene expression in various cell lines derived from these tissues are different from that in the primary tissues."
The map stands to help researchers trace "the main developmental expression states leading from an embryonic stem cell to each particular tissue," he said.
The main motivation for the study was to see what sort of insights might come out of the large-scale integration of array data. For example, the researchers were interested in determining whether gene expression in breast cancer is closer to normal breast tissue or to some other cancers, and to see whether the global gene expression space is a continuum or contains "distinct pattern classes," Brazma said.
Brazma and his colleagues also wanted to provide a resource "where it would be easy to query for gene expression in various biological conditions," particularly if the "biological signal in the compiled dataset is sufficiently strong, which turned out to be the case."
Find Your Favorite
The resource currently holds raw data files from the National Center for Biotechnology Information's Gene Expression Omnibus and the EBI's Array Express. Scientists can enter a gene name of interest or a biological state such as a disease and the resource visualizes "how your 'favorite' gene expression varies in different conditions, or which genes are expressed in the selected condition," Brazma said. A scientist who discovers a gene expressed in a disease of interest might want to use the resource to see if it is also expressed in other diseases, for example.
According to the paper, the study shows that this kind of analysis of a large microarray data set compiled from many laboratories "can reveal the overall structure of gene expression space, which could not be observed in any of the contributing studies individually." For example, cell lines were found to cluster together rather than with their tissue of origin.
[ pagebreak ]
Since the data was generated by many different labs, the scientists sought to determine whether there was any impact from laboratory effects in the final results. By measuring the average similarity between assays in different labs within the same biological group, as well as between assays from the same lab on different biological groups, they found that the biological effects were "significantly" stronger than the effects from individual laboratories.
For example, Brazma said, all the blood-related cells are more similar to each other than to a muscle cell regardless of which lab generated the data. "However, the more we zoom in, the more role lab effects are apparently beginning to play," he said.
The authors note in the paper that it's possible that laboratory effects "are too strong to achieve resolution beyond the six major classes" ― particularly since hierarchical clustering did not conclusively reveal "finer structures," even though some specific groups, such as leukemias, did cluster together.
"What exactly is the limit to our resolution is difficult to say, since most laboratories deal with only particular classes of biological samples," Brazma said. The example of the leukemias is probably "close to the limit of our resolution," he said of the results.
Brazma likened the six classes of gene expression to the five major protein structure classes, which has become "one of the most fundamental results of structural biology," he said.
"Some of the continents that we found were no surprise, even though, when I ask people to guess what the continents are during my talks, rarely anybody can guess this accurately," he said. For instance, it was no surprise that the largest divide is between the hematopoietic or blood-related system and the rest, and that the nervous system is rather different from the rest. "However it was not obvious that incompletely differentiated cells and connective tissues, such as bone marrow cells and fibroblasts, would have a 'continent' of their own, and what cells exactly would form that continent."
"I think that the particularly striking continent is how similar most cell lines are to each other and how different from their tissues of origin," Brazma said.
He explained that most of the data are from diseased tissues, since few people donate healthy tissue, which means "that the data may be somewhat skewed" toward disease states. "It's a bit like looking at the planet Earth from a distance in a telescope during a solar eclipse ― we can see only a part of it," he said.
While the global map of human gene expression revealed six major "continents," the researchers cautioned that more continents might emerge since the dataset is not complete and additional tissue types might reveal other transcriptional classes. In addition, they noted that "finer structures" likely exist within the six groups that they identified.
Groups and Meta-Groups
According to the study's supplementary material, the data in ArrayExpress had been annotated with the MGED Ontology categories cell line, cell type, organism part, disease state, developmental stage and these ontologies were retained in the analysis. The GEO text annotations were converted to MO using the text-mining tool Whatizit using a custom dictionary, followed by manual curation.
Brazma emphasized the importance of metadata for this work ― "in particular the description of experimental variables and sample properties," for example the particular tissue type or disease state. These issues explain "why we limited our data source to ArrayExpress and GEO repositories," he said. These repositories have adopted Minimum Information About a Microarray Experiment requirements, "which made all the difference."
Brazma added that it in doing this data integration and visualization, it was "a bit easier" for the team to use ArrayExpress "as we try to enforce more structured description of samples, while to use GEO we need to use text mining."
The data analysis, in fact, provided "valuable feedback on the metadata." When he and his team came across metadata that "did not make sense" or looked like a mis-annotation, they turned to the papers associated with the respective datasets and found that sometimes "metadata provided in ArrayExpress or GEO were not accurate and we were able to correct it."
In addition, the researchers needed to develop a new query interface for the Human Gene Expression Map.
The interface developed for the Gene Expression Atlas "wouldn't work for this," he said, because the dataset is "simply too big." For now, he said that the team had to develop a separate implementation for the Human Gene Expression Map, "but eventually it will simply be one, though rather distinct, page in the EBI's Gene Expression Atlas.
"At the moment we have only a beta version [of the query interface] for this dataset," Brazma said.
Although the gene expression map resource does not currently let scientists travel around their own datasets, Brazma said that all the annotated data are available for download from ArrayExpress.
Scientists "with sufficient expertise can do such integration with their own data in their labs," he said. Brazma added that he and his colleagues are exploring how "one might place one's own experiment into this expression space," and a standard way to provide such a service.