At A Glance
Name: Edward Marcotte
Position: Assistant Professor, University of Texas, Austin
Prior Experience: Co-founder, Protein Pathways
How did you get into proteomics?
I started out as a molecular biologist, but I was frustrated by not being able to say in sufficient detail how the proteins and genes worked. So I moved to x-ray crystallography because it allowed me to get, at the atomic level, mechanistic information about what a protein was doing. But by focusing in so tightly, I was losing my perspective as to what was going on around this protein, and who it was working with. While I was a postdoc [in David Eisenberg’s lab at UCLA], the first genomes were coming out, and it was clear that something dramatic had happened in biology. My colleagues and I started to learn how we could use this new kind of information from genomics and functional genomics to learn about all of the proteins and their relationships.
I developed computational methods to discover the functions and interactions of proteins on a genome-wide scale. For example, we would look at the fusions between genes that occur differentially among organisms. And the fact of two genes being fused in one organism and separate in another allowed us to draw inferences about the separate genes perhaps working together. And when we extend that across thousands and thousands of pairs of genes, we start to reconstruct systems of proteins and pathways. We did other sorts of analyses similar in spirit to that where we used comparative genomics information to discover which proteins are working together.
As a faculty member [at the University of Texas in Austin], it became clear to me that we were generating literally thousands of hypotheses about what proteins did, and we needed data sources to test and validate these. So I have been turning to techniques that allow us to learn something about thousands of proteins at once, and that has led to mass spectrometry.
What kinds of mass spec equipment do you have in your lab, and what do you use it for?
Right now we are using a Thermo Finnigan ProteomeX LCQ Deca XP Plus system, basically an ion trap and a multidimensional chromatography system. We use it for several things. One is, we try to establish a quantitative phenotype for cells that have been perturbed in a certain manner. We treat the protein expression levels as telling us something about the state of those cells, and as we perturb the cells by looking at different knockouts, we see the protein levels change, and that tells us about what the gene that was knocked out was doing. We are also treating it as data that allows us to learn about which proteins work together in systems, so we are discovering systems of proteins by their patterns of expression and the conditions they are expressed under.
We work mostly in yeast and bacteria such as E. coli and Mycobacterium, but we also look at human samples. Historically, all of our computational methods have been developed largely for yeast. Using the computational methods combined with information from DNA microarrays and other large-scale functional genomics data, we have been able to reconstruct interaction networks between yeast proteins consisting of thousands and thousands of interactions for about half of the proteins of yeast. Right now the proteomics data is being used to extend and to validate these networks. But we are looking to go into more elaborate systems and to see how much these ideas that we developed in lower eukaryotes extend into higher systems like humans.
How much do these analyses draw on experimental data, and how much is predictive?
We use the experimental data to make predictions. At the core, all of it is experimental. The input is genome sequencing data, DNA microarray data, proteomics protein expression data. But the analysis that teaches us which proteins work together is the result of fairly complex computation. For example, one of the analyses that we do surveys a given protein across all known genomes and looks at the distribution of that protein across all known genomes. If two proteins share the same distribution, it turns out that that’s almost always because they are working in the same pathway. It’s a non-obvious conclusion that comes from this very broad survey involving often millions or billions of sequence analyses.
Could you give another example like that?
For lower organisms, proteins that are always adjacent on the chromosome turn out to be in the same pathways. These are methods developed by other people — Russ Overbeek, Peer Bork, and Julio Collado-Vides — and they provide a way to reconstruct operons automatically. When you integrate these pieces of information with the distribution across organisms, the presence of gene fusions, and these other sorts of methods, what comes out of it, along with DNA microarray data about co-expression and so on, is a very complex interaction network for the organism.
How easy or difficult is it to make the jump from bacteria to yeast to higher organisms?
For bacteria it is quite easy to calculate networks. The yeast is also quite easy to calculate such networks for, and the networks are of increasingly better quality. Yeast haven’t evolved a lot of the complicating factors of humans. For example, the number of alternate splice variants in yeast is very, very low. By the time you get up to humans, you have many different proteins associated with each gene. Part of what’s necessary in going to higher organisms is developing algorithms and ways to deal with these complications.
Your website says you are also developing protein microarrays?
We wanted protein [antibody] microarrays to measure protein expression levels. Of course that requires that you have many different antibodies. Although it’s an interesting and promising technology, we have almost entirely gone with the mass spectrometry proteomics because it got around the requirement for specific binding partners. Several of my colleagues are developing high-throughput methods to develop single-chain antibodies with in vitro evolution selection techniques and RNA and DNA aptamers specific to each protein in a genome.
What is the main challenge of mass spectrometry?
The funny thing about mass spectrometry is that the technology is fantastic. What lags is the algorithmic side in interpreting the many spectra that come out of the machine and deciding which proteins produce the peptides that produce the spectra you are looking at. It actually turns out to be a very difficult computational problem.
We are working with a computer scientist named Daniel Miranker [at the University of Texas, Austin] to develop a software called MoBIoS. It is a general database management system for handling many types of biological data, including sequences and mass spectra, and is designed to speed up the searching of very large biological databases by virtue of the way data is stored and accessed. Within MoBIoS, we are building our own algorithms to do peptide mass fingerprinting and to map MS/MS spectra to peptides. We expect to be releasing versions over the next 6-12 months.
What exactly do these algorithms improve?
One area of improvement is how to handle posttranslational modifications in the database lookup. It adds a level of complexity, because obviously instead of looking for a perfect match in the database, now you are looking for a poor match, but a poor match of a certain type, which corresponds to a modification. Another area is the speed of the lookup, which is really just a matter of using different ways to code the algorithms. Currently it takes us about 12-15 hours of experimental time to analyze the complete proteome of a bacterium, and then about 12-15 hours of computer time, so it’s nowhere near in real time as we are analyzing it. That’s just a matter of tuning and tweaking and parallelizing the algorithms until they are running faster.
What about data storage?
This is another area that needs to be addressed, probably by the community as a whole: how to handle the data and how to remember the history of what’s been measured in the past. Other fields like protein crystallography and now DNA microarrays have universal standards as to data formats, and there are certain public databases available for microarray data and protein structure data. But nothing like that exists for proteomics. The closest is probably Swiss-2DPage. But there are only one or two public mass spectrometry proteomics datasets available. The deposition of these would really generate research in the field and allow people to make algorithmic improvements. Creating a history of proteins that you have previously identified is an effective way to recognize them again when you see them. Such a database would be for the raw spectral data, just as microarray data carry the raw information about the fluorescence of the spots on the chip. People are able to go back to the original chip data and reanalyze it under different statistical models, and get much more information. That’s surely going to happen with proteomics.
What is your involvement with Protein Pathways?
During my postdoc, six of us co-founded a protein company, called Protein Pathways and based in Los Angeles, that uses many of the computational approaches [licensed from UCLA] to reconstruct protein networks and then employ that information in the process of drug design and drug target identification. The company has, for a couple of years, focused on antibiotic development, and now has made the transition into human gene network analysis. It started an experimental arm to validate the targets identified computationally. I am both a co-founder and consultant for the company.