“Garbage in, garbage out.” Every IT guy and gal knows exactly what this means: feed bad data into your program, and you’ll get bad answers out.
Systems biology turns this maxim on its head: “Garbage trucks in, gourmet dinner out!” If you feed truckloads of bad data into your program, from multiple kinds of goofy experiments, good answers will come out. Huh? The intuition is that errors act like noise and cancel out, while true observations reinforce each other. No law of nature demands this to be true, but amazingly, it seems to work. At least a little.
In this article, I’ll cook up an example using two of the smelliest data sources out there: microarrays and protein-protein interactions.
A couple of early papers (published in 2001) do simple analyses in yeast that make the point succinctly.
One, by Andrei Grigoriev, starts by calculating the similarity of all pairs of expression profiles in a microarray dataset and then asks whether the profiles are more similar for genes whose protein products interact. As I’m sure you’ve guessed, the answer is yes (or else there’d be no story to tell). Non-interacting pairs have an average correlation of .03; the value for interactors ranges from .07 to .20 depending on the source of the data. Looks good, but bear in mind that even the best correlation (.20) is not exactly earth-shaking.
The second paper by Hui Ge et al (senior authors George Church and Marc Vidal) starts by doing k-means clustering of the expression profiles and then asks whether interacting pairs are more likely to land in the same cluster. Again, the answer is yes. And again, the ratios are impressive, but the actual values are not. Interacting pairs are about six times more likely to land in the same cluster than would be expected by chance, but most pairs (553/670 ≈ 85%) span multiple clusters.
The basic idea is so simple that I decided to whip up some quick software and try it out on human data. I got my microarray data from Novartis Genome Institute’s SymAtlas — this is a new, improved version of the website I discussed in May. For protein-protein interactions, I used HPRD (discussed in March) — more specifically, a parsed version of HPRD kindly provided by my colleague Eric Deutsch.
I filtered both datasets to include only elements that could be linked to a LocusLink entry and merged expression profiles that mapped to the same LocusLink. This left about 13,000 expression profiles and 5,000 interactions involving about 2,500 proteins.
Next I used R to compute pairwise correlations over the expression profiles. This was simple (it’s a one-liner in R) and quick. But be forewarned: you need enough memory to store the entire correlation matrix. For 13,000 profiles, this comes to 169,000 cells, consuming about a gig of memory.
Not wanting to limit myself to direct interactions, I converted the interaction data into a graph and computed the distances between each pair of proteins (see April). Distance calculation is a standard, but computationally expensive, graph method. A good graph package will offer several different algorithms so you can pick the one that works best for your particular graph.
Generalizing the Grigoriev paper slightly, I compared the median expression correlations for each distance. For directly connected nodes, the median is .17. This drops to .08 at distance 2, almost 0 at distance 6, then climbs to .02 at distance 12, and.03 for unconnected nodes. As in the published work, the correlations are strongest for nodes that are close, but are not terribly high anywhere. As for the dip and climb, I have no clue.
I’m convinced there’s something here. Using even simple methods, you can cook the garbage into edible gruel. New methods are being published all the time. With better methods and more data, it won’t be long before the gruel turns into dinner.
Nat Goodman, PhD, is a senior research scientist at the Institute for Systems Biology and is co-founder of HD Drug Works, which tests treatments for Huntington’s Disease. Send your comments to Nat at [email protected]