At A Glance
Name: Andrew Emili
Position: Assistant professor of proteomics and bioinformatics, University of Toronto, since 2000.
Got CA$6.1 million ($4.4 million) in a Genome Canada competition for a heart biomarker project (see PM 4-16-04).
Background: Post-doc with Lee Hartwell, Fred Hutchinson Cancer Research Center, 1997-2000.
PhD in molecular and medical genetics, University of Toronto, 1997.
BS in microbiology and immunology, McGill University, 1990.
A couple of years ago you published a paper in Nature Biotech about an alternative to ICAT that you called MCAT (see PM 2-11-02). What is the status of that project?
At the point we were doing those studies, we weren’t a well-funded lab. And ICAT, beautiful elegant technique that it is, is quite expensive, because it uses stable isotopes. So the idea was, can we substitute the heavy label isotopes for a chemical moiety, and, perusing literature and testing out quite a few different chemicals, we found a derivitization method that could perform at least in the same ballpark for protein quantitation.
What I’ve learned the past few years, since putting out that paper, is that we have to do a better job in terms of data analysis. That’s really the limiting factor, I think, in accurate protein quantitation. So my lab has shifted quite heavily from doing any sort of chemical labeling strategies to much more an emphasis on data analysis and computation.
So you’re working on informatics now?
All informatics now. That’s the name of the game for proteomics.
The main emphasis right now is the ion signals — that is, the intensity recorded for peptides, and presumably also proteins — recorded in a mass spectrometer are proportional to the relative abundances of those peptides. So that’s not to say that each peptide does not have its unique ionization properties, but it is to say that if you have more of one peptide in one sample versus another, you’re going to see an increase in signal that’s proportional to that amount. The ICAT and MCAT approaches have a built-in internal control to evaluate peptide levels. But I don’t think that’s necessary. We’re us[ing] ion signals, or ion maps, without any labeling. [It’s] sort of doing the equivalent of the Liotta SELDI-TOF ion maps, but now we have a third dimension: time. So we have m/z, we have intensity of peptides coming off our chromatography system, and we have the retention time when the peptides elute. A lot of work is going on in this lab into developing an informatics platform to extract those ion signals of each of the peptide peaks. We can store that as an information map, or peptide survey, that we can then mine later on. One of the major problems I think for accurate quantitation — particularly for low-abundance peptides — is just the random stochastic variation that an LC-MS system gets. It’s noisy. One of the ways to tackle noisy data is just to repeat any experiment. By definition, noise shouldn’t be reproducible, and the real signal should be reproducible.
So our strategy now is, forget about chemical labeling, just run a sample in LC-MS in a very quick manner, not even trying to do MS/MS to sequence the peptides. Just measure the masses of the peptides, and run each sample a few times — between five and 20 times. Then we take all those surveys and ion maps and, using statistical procedures, we combine the datasets to come up with an idealized version of the data. The idea is, we have noisy data, but if we combine multiple experiments, we should come up with less noisy data. And then we think we can do pretty accurate quantitation of the relative intensity or relative abundance of peptides in one set of samples versus another.
Do you have difficulties doing so many repetitions if you have small amounts of sample?
Yes and no. We’re sample-limited only in the sense that people want to sequence everything. So if I was doing a MudPIT experiment, I wouldn’t do it 10 times. What I’m talking about now is not a MudPIT experiment — it’s just a straightforward LC-MS single reverse phase column. You don’t need a lot of material. We think we can find reproducible differences by whatever criteria you want to call reproducible, and then we go after those peaks of peptides and try to sequence them afterwards.
In typical experiments working in this lab, we’re not particularly sample limited [anyway]. We [usually] work with some model organism, like mouse tissue, or yeast, [where it’s] not very hard for us to get a lot of material. Particularly for this profiling, we don’t need hundreds of micrograms — we maybe need 10 or 20 micrograms. So for the heart project, most of the samples are going to be mouse models. It’s quite easy for us to get grams of mouse hearts — diseased versus healthy.
So you’ve done proof of concept at this point?
We’ve done proof of concept, [and] we’ve submitted a paper. Basically we’re going full guns a’blazing ahead with this approach. I think it’s a very powerful way for doing a couple of things. One is sample classification — you don’t really need to know the identities of all the peptides to say ‘what does this sample look like compared to a reference database of samples?’ And I also think it’s a much better means for pattern recognition, particularly with noisy data.
What instrumentation are you using?
We’re a humble, poor lab, so we only have humble, lowly [Thermo LCQ] ion traps. The software’s been designed to use with not high resolution data. So we can track about between 2,000 to 5,000 peptides with an ion trap. I can just imagine if we shifted over to an FT, if the granting gods will ever be good to us, then I can see us tracking 10,000 or 50,000 different features. And that starts to rival microarrays. [But] specialized instruments might be great and all performance-wise, but in practice I’d rather drive a Honda than a Ferrari. A Ferrari is nice, but the Thermo ion trap has been a stable platform for us. It’s been a workhorse, and that’s fine by me.
The fixation in our business has been [with] sequencing — let’s sequence peptide after peptide. My view is, why do we bother to sequence things again and again, especially when we’re not sure that they’ve changed between the samples that we’re interested in? Mining LC-MS ion maps, one can go from [a few] MudPIT experiments, and find everything you’re going to find. And then you say, now I can just track the peptides as peaks — as mass-to-charges and retention times, [and] figure out what the sequence is without doing MS/MS — all I have to do is map it back to our MudPIT data. It’s an informatics problem, but it’s not a technical mass spec problem anymore. So then you can start pumping through sample after sample to look for biological variation across samples, and find the statistically interesting changes in protein abundance.
The one thing that’s killed the microarray community is that, because it’s expensive to do a microarray experiment, people don’t tend to repeat experiments enough to get statistically valid data. And that’s cursed the field, because a lot of people are making conclusions on not-robust data. I don’t want to follow that path in proteomics. If anything, we have a worse scenario. MudPIT takes a day and it’s choppy and sequencing is stochastic and it misses things, and the quantitation, no matter what technique people use, is quite variable — it’s just noisy data. So either we’re going to collectively make a mistake and pump out data that is not reliable, or the field’s going to pull itself up by the suspenders and say, ‘we’ve got to bring in informatics to make sure our conclusions are reliable.’
Other than informatics and statistics, what other big obstacles are we facing in proteomics?
Fundamentally, the biggest issue is just access to the technology. Virtually every lab in molecular biology has a PCR machine and has access to doing an experiment with a microarray. On the other hand, with proteomics, there are very few labs that can do MudPIT, very few groups that can do these kinds of things. It’s just, the technology is not there. So my view is, I love the FT-MS, the Q-TOFs, all these great instruments, but I don’t think they’re particularly enabling for biomedical research the way they’re going right now.
What I’ll say as an example of the huge need out there is that there’s a lot of interest in the Ciphergen SELDI-TOF platform. As far as I can tell, that’s a pretty poor, low resolution, instrument. But the platform concept and the ease of use has made it the choice for many biomedical centers. This worries me, because I think they’re going to have a backlash [if] the data all turns out to be useless, but I think that tells you that people want to get their hands on the instrumentation. The manufacturers I don’t think are going that route. Maybe in five or 10 years, most researchers will have access and it will be trivial to do a proteomics experiment. Some things are easy now. But things like MudPIT [are] no comparison to microarrays. Every lab can do a microarray now. Very few can do a protein profiling experiment.
Do you think it’s just a matter of time, or do we need to come up with new techniques entirely?
I think it has to be the techniques side. Because not to criticize anyone, but ICAT was supposed to revolutionize protein profiling. That was published in 1999. It’s now 2004. If you look at the number of papers using ICAT, it’s a handful. The impact of that technology on proteomics, I would say, has been pretty minimal. Whereas microarrays, maybe they’ve had a few-year advantage, but you can’t underestimate how much influence they’ve had on the biomedical community. I get worried that if enough people don’t see some return on the investments in mass spec centers and things like that, then there might be a bit of a backlash. I’m an optimist though — I wouldn’t be in this business if I didn’t think it was worthy.