Protein interactions, the basic building blocks of biological systems, are the elementary particles of systems biology. So you’ll have to get good at working with protein interaction data if you’re planning to do systems biology. But be forewarned: Get out your data scrubbers and hip boots, because this data is full of errors. Even if you think you’re used to working with messy data, protein interaction takes it to a whole new level.
A lot of protein interaction data comes from large-scale experiments — a notoriously unreliable source. Two seminal papers published in 2001 produced large datasets of yeast interactions containing 4,000 and 1,000 interactions respectively, but the overlap between the two datasets was a whopping 146 interactions! Either yeast has a ton of interactions yet to be discovered, or the big datasets are full of junk. Or both. An analysis of these datasets in combination with other yeast data was published about a year later. Bottom line of that report: about half the large-scale data is wrong.
A considerable amount of protein interaction data also comes from small-scale experiments reported in the literature. This data is probably more reliable, although I’ve never seen a careful analysis that proves it. Regardless, as I’ll show below, more errors are introduced when the data is transcribed into the databases.
There are four major public protein interaction databases: BIND by Chris Hogue, University of Toronto; DIP by David Eisenberg, University of California, Los Angeles; MINT by Gianni Cesareni, University of Rome; and HPRD by Akhilesh Pandey, Johns Hopkins University. BIND and DIP contain both large- and small-scale datasets (with an emphasis on the large), while MINT and HPRD only contain data curated from the literature (mostly small).
BIND contains 47,449 interactions. I was able to estimate a breakdown by species by running queries and analyzing downloaded data. Most of the data is from yeast, worm, and fly, with 1,062 human interactions, 543 mouse, and 54 rat. Of DIP’s 44,124 mostly yeast, worm, and fly interactions, the human, mouse, and rat counts are 1,136, 284, and 105. MINT’s 15,624 mostly yeast interactions include 1,875 human, 902 mouse, and 250 rat interactions. HPRD focuses exclusively on human data and contains 14,545 interactions, an order of magnitude more human data than any other site.
I spot checked the databases with my favorite gene, the gene associated with Huntington’s Disease (HD, LocusLink 3064). As a baseline, I took interactions reported in surveys by Elena Cattaneo and colleagues published in 2001, and by Marcy MacDonald in 2003. The papers mention a total of 40 interactions, but only 15 are mentioned by both. BIND has just three HD interactions, all of which are in the baseline. DIP seems to have no HD interactions. MINT has 26, of which 20 are in the baseline. HPRD coincidentally also has 26, of which 19 are in the baseline.
I checked out the interactions that are in the databases but not the baseline. Of the six such interactions in MINT, two look legit, three look bogus, and the last seems to be a duplicate entry with an old gene name. Of the seven extra interactions in HPRD: two look right, two seem clearly wrong, one is probably an indirect interaction, and two are naming problems.
Protein interaction data is a mess. Early analysis indicates the large-scale datasets are about half wrong. My spot check of the small-scale datasets found a 15 to 20 percent error rate in data entry, and no doubt more errors are lurking in the source data.
The upshot is that you can’t trust any single piece of interaction data. You have to work with it in bulk using tools that are tolerant of errors. Join me next month for a look at these magic data cleaner-uppers.
Nat Goodman, PhD, is a senior research scientist at the Institute for Systems Biology and is co-founder of HD Drug Works, which tests treatments for Huntington’s Disease. Send your comments to Nat at [email protected]