By Aaron J. Sender
Golan Yona’s computer screen in his Cornell University office displays what looks like a picture of outer space. But what the galaxies represent is the protein universe. He zooms in and navigates his way through the clusters to explore the relationship between individual proteins and among various subclusters. Instead of a map that accounts for a single feature, Yona’s ProtoMap integrates sequence, structure, polarity, interaction, and expression data to offer a multidimensional picture of protein space.
Usually biologists search databases for proteins or genes with similar sequence to get some clues about its function. “But pairwise comparisons, or local comparisons, can take us only so far,” says Yona. “Because each time you are just comparing your protein with a single protein in the database at a time.”
In 1993, as a grad student at Hebrew University in Jerusalem, Yona began to arrange the pairwise sequence relationships into a global view of how any single protein is related to the others, much the way early cartographers charted maps of Earth.
Ancient travelers collected information about pairwise distances between cities. And based on this information, our ancestors were able to plot maps of Earth. Though distorted, they offered information not available before. Instead of a simple list of distances between cities, shapes and sizes of continents began to emerge. “That is what I’m trying to do with protein space,” says Yona. “Hopefully this map will reveal information that pairwise comparisons don’t tell you. From position in the global map you will be able to learn something about the context of your protein.”
First Yona tried to embed proteins into a Euclidian space. He calculated distances based on sequence homology between individual proteins and assigned each a Euclidian vector while preserving the distance between original pairs. But he soon abandoned that approach. “It’s a very hard problem, especially when you are dealing with hundreds of thousands of proteins,” says Yona. “Because you have all those pair-wise distances between proteins, you’re trying to create an image that will satisfy 500,000-squared number of constraints.”
So he turned to a graph approach as a way to represent the protein space. “This was actually very effective,” he says.
As a Stanford University postdoc, Yona began adding structure information into the equation, and now in his own lab at Cornell he has just received a $1.1 million NSF grant to continue his work and fold in more parameters, such as protein-protein interactions, expression, and pathway data. “It’s not trivial how to combine all this different information,” he says.
Some data are particularly difficult to deal with. “Protein-protein interactions are one of the most problematic data. Because there is not much of that data available and it’s also very noisy,” says Yona.
The function of about half of all known proteins is still a mystery. By positioning proteins in a highly dimensional way that considers everything known about them, Yona hopes that function clusters will begin to form.
The third release of ProtoMap is available through Cornell at http://protomap.cornell.edu and a reworked version, including the new types of data, will be ready this summer.
“We are probably still very far from a complete, very accurate map of the protein space. And this is because a lot of the information is still emerging and new technologies are still being developed to measure proteins,” says Yona. You’ll just have to wait a bit before you can hitchhike to the restaurant at the end of the protein universe.