Hairballs: How to Deal with Dataset Detritus
What do cats and biologists have in common? They purr sweetly when they want something? C’mon, get real! They demand to be heard and refuse to be herded? Getting warm. Hint: it has something to do with hacking up hairballs.
Biologists have gotten very good at spitting out datasets that show connections between proteins, genes, phenotypes, and anything else they’ve licked up. These datasets are like giant hairballs — jumbles of cross connections made by experimental methods that are very good at finding false positives.
A new generation of software tools is being developed to clean up these messes, find the good connections, and sweep the hairballs into the dustbin.
Graphs are great
A graph, as I’ll use the term here, is a mathematical diagram consisting of dots and lines connecting them. Think of a child’s connect-the-dots sketch book. In the jargon of the field, the dots are nodes, and the lines are edges.
A graph is a natural way to represent protein-protein interaction data. Nodes represent proteins and edges indicate which proteins interact — simple and sweet.
Graphs can represent other kinds of interaction data, too. Suppose you have data telling which proteins regulate which genes. You can make a graph where each node represents both a gene and its protein product (this is biologically sloppy — but close enough), and an edge from node A to node B means that protein A regulates gene B. In this case, the edges have a direction (from A to B) and are drawn as arrows.
Suppose you have gene expression data from knockout experiments that tell the genes whose expression profiles are affected by each knockout. You can make a graph where the nodes are the genes and an edge from gene A to gene B means that knocking out A affects the expression of B.
Suppose you have gene expression data and have calculated correlations between expression profiles for a bunch of genes. The nodes can be the genes and the edges can tell which genes have highly correlated profiles.
Suppose you’ve calculated sequence similarity for a group of sequences. The nodes can be the sequences and the edges can tell which are highly similar.
The beauty of graphs is that they can handle all kinds of pairwise data: physical interaction, genetic interaction, correlation, similarity, you name it. And, not surprisingly, you can combine multiple kinds of data in a single graph and get an overarching view of the many ways your genes and proteins are related.
The hairball effect
Drawing a graph is a hairy problem, especially when a computer is the artist. Suppose you have 100 proteins with 10 interactions each. That’s a diagram with 100 nodes and 500 edges. The poor computer has to figure out where to place each node on the screen so that the edges between them don’t get too tangled. Unless the interactions are very orderly (which doesn’t happen in biology), the drawing will be a jumble of dots and lines.
The more data you pour into the graph, the worse the picture gets. It’s a dilemma worthy of Garfield: graphs are great at representing multiple kinds of data; but if you take advantage of this and put lots of data into your graph, you’ll end up with a mess that’s too complicated to draw — an in silico hairball.
To have any hope of working with biological graphs, you need software that can automatically draw them. This is a well-studied, hard problem in computer science, called graph layout.
Two full-featured academic packages for working with biological graphs are Osprey from Mount Sinai Hospital at the University of Toronto and Cytoscape from the Institute for Systems Biology, UC San Diego, and Memorial Sloan-Kettering Cancer Center. [Disclosure statement: Cytoscape is a major effort at ISB where I work. I’m not personally involved in Cytoscape development.]
Cytoscape and Osprey operate in a similar manner. You provide files that define your graph. The program computes a layout and draws the graph on your screen. You then interact with the graph by mouse or dialog box. You can select nodes by clicking on them or dragging over them in the usual way, or by querying based on name or properties such as GO annotations. You can also select nodes based on graph properties — for example, selecting nodes that have at least two neighbors (which has the effect of pruning dangling branches from the graph and leaving just the well-connected core). Both programs are Web savvy and comfortably grab data from external databases as needed.
Cytoscape features a plug-in architecture that allows programmers to extend the base program with specialized analysis methods. The tutorial provided on the Cytoscape website includes a simple gene expression plug-in that does standard profile searching. This is, at best, a cute demo. Better gene expression software is available from many sources. I am told that more sophisticated plug-ins are used internally at ISB and will be made public soon.
Get the picture
Osprey’s default layout algorithm is a circular method. It places nodes on the rim of an imaginary circle, grouping nodes based on their GO process annotations so that genes involved in the same biological process are near each other on the circle. Edges cut across the circle, but edges between nodes involved in the same process hug the rim, since these nodes are close to each other.
This kind of layout is good for showing cross talk between different GO processes so long as the graph isn’t too complex. If there are too many edges between categories, you get a worthless picture with nodes along the rim, and a mass of edges filling the body of the circle.
Cytoscape’s default layout algorithm is a force-directed method based on a physical analogy. Imagine that nodes are balls and edges are springs. If a spring is stretched, it pulls the balls together. But if the balls get too close, the spring is compressed and pushes them apart. The layout algorithm works by solving the system of equations implied by these springs.
Force-directed layout is considered the cream of the cat food, but it, too, can be easily stymied by complex graphs.
I tried Osprey and Cytoscape on a series of examples drawn mostly from the sample datasets distributed with Osprey. I judged the layouts subjectively by eye. The results were clear cut, I think, but I urge readers to try it for themselves. I’ll discuss the results in order of graph size.
The first example was Osprey’s Harnpicharnchai et al., dataset. This is a small tree with 23 nodes and 22 edges. Both programs did fine.
The second example was Osprey’s Purified Complex dataset consisting of 158 nodes and 121 edges (0.8 edges/node). This proved to be a good illustration of the differences between the layout methods. Osprey’s default circular layout focused attention on the interactions between GO processes, while Cytoscape’s layout focused on graph structure. The Cytoscape layout showed that the dataset comprised numerous independent components, the largest of which had just 14 nodes. This means that the graph is much simpler than its raw size would indicate. Osprey’s Forced Spokes layout produced a similarly informative display.
The third example was Osprey’s Tong et al. dataset (203 nodes, 289 edges, 1.4 edges/node). Cytoscape’s layout revealed that the graph was a starburst consisting of eight or nine large trees with a few connections between them. Osprey’s circular layout showed none of this, focusing as usual on interactions between GO processes, but its Global Spoked Dual Ring layout gave the same general picture as Cytoscape.
Example four is the interaction dataset from Cytoscape’s tutorial #2 (330 nodes, 362 edges, 1.1 edges/node). Naturally, Cytoscape did fine on its own example. The display showed one core component containing most of the nodes plus several smaller pieces. The core component consisted of one large cycle, with several small cycles and dangling branches hanging off the side, and a small mass of nodes in the middle. All in all, it was a reasonable picture. Osprey’s default display showed a web of disorganized connections, but its Forced Spokes layout produced a nice picture that revealed islands of strong connectivity with relatively few links between them.
The final two examples were Osprey’s Synthetic Lethality (755 nodes, 982 edges, 1.3 edges/node) and Affinity Precipitation (2,365 nodes, 6,882 edges, 2.9 edges/node). Neither program could do much with these.
I also looked at two more specialized programs: VCN (Visualization of Clustered Networks) by Nizar Batada at Stanford University, and VxInsight from VisWave.
VCN is much simpler than Osprey or Cytoscape and can only do one thing, namely, display interactions between categories (e.g., GO processes). You give the program a file that defines your graph and another that tells which nodes are in each category. The program derives a new graph whose nodes represent the categories and whose edges have a weight that tells how many edges in the original graph connected nodes in the endpoint categories. It does a circular layout and draws edges whose widths reflect their weight. This gives a quick visual impression of which categories interact a lot.
I tried VCN on the two examples that stymied Osprey and Cytoscape. It produced a useful display for the smaller one.
VxInsight is sold as a general-purpose data visualization tool. It does a force-directed graph layout and then constructs a “terrain view” of the result showing mountains in regions where the layout is dense. I learned that it’s possible to feed VxInsight layouts from other programs, and tried it out on a few examples. When the layout is good, VxInsight provides a useful overview of the structure.
Neither of these programs can stand alone, but they are useful adjuncts to the full-features packages.
No magic elixir
I tried a handful of general-purpose (i.e., non-biological) layout programs, listed in the table on p. 36, to see if a magic potion exists outside our field. Some programs provide many more options and parameters for fine-tuning the layout, but they didn’t do much better on the hard cases. Pajek (Slovene for spider) gets the prize for best animation. The yEd package uses the same layout software as Cytoscape but offers more options.
Graphs are a great way to represent protein interactions and other pairwise data, but a horrible way to display the results. In an ideal world, we’d use graphs for analysis and something else for display. Sadly, biologists are convinced otherwise.
None of the programs I tried are yet up to the task of cleaning up the hairball graphs being spit at us. They’re fine for small graphs, and I recommend that everyone get one or two to look at small datasets. But, for big, complicated messes, you’re on your own.
So, crack open those graph theory books. With a little study, you too can be an expert in the emerging field of bio-hairball removal.
Nat Goodman, PhD, is a senior research scientist at the Institute for Systems Biology and an affiliate professor of bioinformatics at University of Alaska-Fairbanks. Send your comments to Nat at [email protected]
GET GRAPHIC: biological graph layout and display packages
SEE FOR YOURSELF: general purpose graph layout and display packages