Molecular and Computational Biology, University of Southern California
Name: Xianghong Jasmine Zhou
Title: Assistant Professor, Molecular and Computational Biology, University of Southern California
Professional Background: 2003 present, assistant professor, molecular and computational biology, University of Southern California; 2001-2003 post-doc, Harvard University; 2000 post-doc, University of California, Los Angeles
Education: 2000 PhD, Swiss Federal Institute of Technology; 1995 Diploma, University of Tuebingen, Germany
Earlier this month, Xianghong Jasmine Zhou, an assistant professor of molecular and computational biology at the University of Southern California, received a $200,000 grant from the National Institutes of Health to continue work on software that will enable the cross-platform integration of microarray data.
The funding is the first installation of a five-year, $1 million commitment from NIH, and Zhou will use the cash injection to move forward in exploring avenues for integrating not only microarray data, regardless of platform, but other kinds of high-throughput genomic data. It is all part of Zhou's effort to give researchers tools to get more out of existing data than is being utilized at the moment.
To do that, Zhou's lab at USC has developed a free software package called iArray. Though it has been available on the lab's website since June 2005, Zhou said that a newer version is being prepared for launch later this year. Curious about the new funding and what iArray can do, BioArray News spoke with Xianghong Jasmine Zhou last week.
How long has your lab been in existence?
Since September 2003. I came here to USC in fall 2003 to take an assistant professorship. It's been almost two and a half years.
What have been your primary objectives?
Our primary objective is to develop computational methods for systems biology. In particular we develop algorithms and statistical methods to integrate the diverse types of high-throughput genomic data. For example, microarray data, sequence data, protein-protein interaction data, and protein-DNA interaction data most high-throughput genomic data.
What kinds of challenges does face your lab in that regard?
When we integrate data, we want to ask particular biological questions. I think in this regard, identifying interesting biological questions is actually the most important and probably the most challenging part.
You recently received a $200,000 grant from the NIH to work on cross-platform microarray data integration. Is this the first time they've funded you? How much has NIH given you for this work in total?
Yes, this [is the first time they've funded us] and it is a five-year grant worth $1 million total.
What is your objective for that NIH grant?
My objective is to develop computational methods to integrate cross-platform microarray data. And actually that is currently the main focus of my lab. Although we are working on different types of high-throughput genomic data, the microarray data is in the center place right now.
Where are you getting the microarray data?
We are working on public data … for example, [data from] the National Center for Biotechnology Information's GEO database, the Stanford microarray database, and the ArrayExpress database.
What are the main platforms that the data is coming from in those databases?
Most of them are Affymetrix data. Some of them, for example in the Stanford database, are cDNA microarray data. But using our approach we really don't care which platform we use that's all fine.
But why is there a need for cross-platform data integration?
Right now, microarray gene expression profiling has been conducted in thousands of labs worldwide, which has resulted in rapid accumulation of microarray data in the public repositories. However, the re-usage of the data accumulated is very low. Although the data is generated with high cost, mostly the data is only utilized once and only in one publication. And then people forget it. So this is a huge waste.
There is a great potential if we can integrate this data to perform system-level studies. Because each data set captures a series of snapshots of biological systems under a set of coherent perturbations, hundreds of data sets will provide you hundreds of sets of snapshots under different perturbations. If you can integrate those data sets together, it will give you the possibility to get deeper insight as to how the system works.
Also, we know that microarray data is very noisy. So if you get a signal from one data set, you may want to ask yourself, 'Is this signal or noise?' But if you integrate several data sets, and this signal repeatedly occurs, you know that this is a real signal. That is, by integrating different data sets we can enhance the signal-noise separation.
Why haven't commercial and academic entities been able to meet that need?
Actually recently, besides our lab, there are quite a number of efforts that have been paid to integrate cross-platform microarray data, mostly in academia. I think the reason why commercial companies haven't developed software yet is because the method has not been standardized yet. This probably creates difficulties for people if they only want to develop software but not methods.
But if a lot of this data is coming from certain platforms then wouldn't it encourage companies such as Affymetrix to create something that they can use to compare across labs?
We have developed our Integrative Array Analyzer, iArray in short, which we demoed at the Intelligent Systems for Molecular Biology meeting in June 2005, and it has gotten a lot of positive feedback. I have not contacted Affymetrix yet, but I think they are probably interested in using such a public resource, or telling their customers that such a resource exists.
When will the 'official' version of iArray be released?
The first beta version was released at ISMB last year in June, but I have not advertised it in the last half year just because we want to make it more robust before we really release it. A few days ago we submitted a paper about the software. The current software is relatively robust and can be freely downloaded from our web page [http://zhoulab.usc.edu].
What kind of tools in iArray will be available to users?
iArray is a data-mining and visualization software platform for the integrative analysis of multiple cross-platform microarray datasets. We employ a meta-analysis approach to first derive the expression pattern from each individual microarray dataset, then search for patterns frequently occurring across multiple datasets. Typical analyses include co-expression analysis, differential expression analysis, functional annotation, and transcription factor prediction.
How does iArray work?
As input, iArray can accept microarray datasets from any platform, as long as the data have been summarized into a matrix of normalized expression values.
Typical analyses in iArray include co-expression analysis and differential expression analysis. In co-expression analysis, we model each data set as a correlation graph, where one vertex represents a gene and two co-expressed genes are connected with an edge. Given k microarray data sets, we will derive k graphs, on which we identify recurrent subnetwork patterns. In differential expression analysis, we first identify genes differentially expressed in each microarray data set, and then use the frequent itemset mining algorithm to identify sets of genes simultaneously differentially expressed across multiple data sets. In addition, iArray can also be used to identify conserved expression patterns across different species.
Furthermore, we also implemented functional analysis and transcriptional analysis modules. The functional analysis module can be used to identify over-represented GO functional categories in any derived gene sets. Moreover, we integrated [the] Biocarta and KEGG pathway databases, so that users can identify biological pathways present in the given gene set. The transcriptional regulator annotation module is used to predict potential transcriptional regulators for any given gene cluster or any differentially expressed gene set.
All of this work is done in conjunction with our research work in the lab. iArray should serve as a vehicle for us to deliver our methods to biologists.
What do you think about recent efforts to standardize microarray experiments like those being undertaken by NIST or MAQC?
I think these are very useful efforts. And it should definitely reduce the systematic variation across different platforms. But it is not possible to completely avoid systematic variation between data sets. And actually, within the same platform, within the same lab and even with the same protocol different people do the experiments and we find out that the data is representative of different people. It's not the biological nature of the data that creates these differences; it's who generated the data. So that means that systematic variation is beyond the platform.
In a paper published in Nature Biotechnology last year you used the term 'second-order gene expression analysis.' Can you give me a definition of what that means?
[In the paper] we define the first order of analysis as the extraction of patterns from one microarray dataset that's what people normally do and we propose that second-order expression analysis is a study of the correlated occurrences of those expression patterns across multiple data sets measured under different types of conditions.
For example, from one microarray data set you have identified expression cluster one, expression cluster two, and expression cluster three. And now you have 10 data sets. You want to know whether some expression clusters always co-occur in the same data sets. If the clusters one and two exhibit a correlated occurrence across the 10 data sets, then we would say that these two clusters have second-order correlation.
This method allows us to identify something we cannot identify using the standard first-order analysis methods. For example, we can identify genes of the same function, but without a co-expression pattern, and we can also study the cooperativity among transcription factors to reconstruct transcriptional regulatory network.
Is this idea behind the software you have developed?
Actually it's not. The software so far has not incorporated second-order analysis, but we will do that in the future. Second-order analysis is, so far, very time consuming. But we will include a simple version that is not as time consuming [in the future].