NEW YORK (GenomeWeb) – Researchers from Harvard University and the Technical University of Munich recently published a paper in BMC Bioinformatics that describes a method of combining and visualizing multiple omics datasets culled from a variety of sources in concert, and identifying correlations between the data.
According to the paper, the method, called multiple co-inertia analysis (MCIA), is based on a "covariance optimization criterion" that simultaneously projects several datasets such as gene expression and proteomics data into the same dimensional space, then transforms the diverse sets of features in the data onto the same scale.
Researchers can then analyze the disparate data types and extract information on trends that occur across the various datasets. The paper also includes a description of two studies that demonstrate the efficacy of the method. In the first, the researchers used the technique to analyze transcriptome and proteome data from an NCI-60 cancer cell line panel, and in the second they compared ovarian cancer transcriptome data generated on two different microarray platforms and a next-generation sequencer
MCIA is an extension of co-inertia analysis, which as described in a 2003 BMC Bioinformatics paper, is a "multivariate method that identifies trends or co-relationships in multiple datasets which contain the same samples." That paper was co-authored by Aedín Culhane, a research scientist in the Harvard School of Public Health's department of biostatistics and one of the authors on the current BMC Bioinformatics paper.
Co-inertia analysis is related to canonical correlation analysis (CCA), both of which are essentially extensions of principal component analysis (PCA), Culhane explained to BioInform this week. In PCA, you find the linear combinations (principal components) that capture a maximum amount of variance in one dataset. In both co-inertia analysis and CCA, "what you are trying to do is project two datasets into the same space" and to do that, "the principal components should capture the variance that drives the co-structure between the datasets," she said.
The difference between the two is that with CCA "what you are trying to do is maximize the correlation … between the principal components, so you are asking what the most correlated trends are in these two datasets," Culhane said. With co-inertia analysis, she added, "you are actually trying to find what's the most co-variant trend between the two datasets." Multiple co-inertia analysis extends the abilities of the initial co-inertia analysis method to allow researchers to couple and identify correlations in more than two datasets at a time.
The underlying co-inertia analysis methodology was first used in ecological studies where researchers used it to look for links between environmental variables and species characteristics. Drawing from the applications in the ecological domain, Culhane and other colleagues, including Amin Moghaddas Gholami, a bioinformatics group leader at TUM and one of the co-authors on the MCIA paper, set about adapting co-inertia analysis to work for genetic data and they have successfully applied it in various studies on various data types.
In the aforementioned 2003 paper — the first demonstrating the application of co-inertia analysis to genomics data — the researchers used the method to analyze gene expression datasets from the NCI-60 panel that were generated on two separate microarray platforms. In two other earlier studies that Culhane was involved in, she and colleagues worked on combining proteomics data and gene expression information; and on integrating transcription factor binding site information with gene expression datasets.
Studies done by Gholami that use the method include one that was published in 2010 in Bioinformatics and used several techniques to analyze microarray data. Specifically, they used co-inertia analysis, back-transformation, and Hungarian matching to find co-structure in datasets where the samples are not matched. Gholami's team also used co-inertia analysis to integrate and explore proteome and transcriptome data from the NCI-60 panel. Details of that study were published in a Cell Reports paper last year.
MCIA, meantime, has been used in at least one other study focused on identifying outlier genes and species in phylogenomics data. One of the benefits of the method compared to traditional approaches is that "you don't have to pre-filter or pre-map all of the variables onto the genome," before trying to find correlations in the data, Culhane said. This helps researchers avoid difficulties associated with merging annotations from different technologies which may have different probes and detection methods. It also means that "you can basically take any number of variables and they don’t have to match, … you can analyze all the variables and don't have to restrict yourself to just analyzing the intersection of variables from the datasets ," she said.
Furthermore, since the data ultimately inhabits the same space and is changed to the same scale, "you can now take the coordinates for all the genes [and] proteins or other variables that have been integrated and transformed onto the same scale, and put them into another analysis" revealing new biological insights that might not have been obvious from analyzing the data in isolation, she said. For example, the third figure in the paper shows how an integrated analysis of proteomics and microarray data from cancer cells using MCIA highlighted an important signaling pathway involved in the cancer, an observation that might not have been made if the datasets had been analyzed separately, according to Culhane.
In emailed comments, Gholami highlighted the challenge of making sense of the large quantities of omic data currently available to researchers. "We need to make big data look little and for that we need common features and ways to present data to scientists with different skills and backgrounds," he told BioInform. "MCIA does that all. It shrinks data all in one place, [highlights] interesting correlations … and [lets users] visualize the data in a simple 2-3 dimensional space that facilitates data interpretation."
For their next steps, she said, the researchers will work on new ways to both integrate data and better identify relevant pathways and gene expression signatures, she said. The current incarnation of the MCIA method is available in the omicade4 package on the R/Bioconductor website.