This story has been updated from a previous version to include contributions made by Ermir Qeli, a collaborator on the research.
By Tony Fong
Name: Christian Ahrens
Position: Scientific coordinator bioinformatics, Center for Model Organism Proteomes, 2007 to present
Background: Staff scientist proteomics, bioinformatics, Functional Genomics Center Zurich, 2005 to 2006; head of bioinformatics analysis, senior bioinformatics scientist, drug discovery, MDS, 2002 to 2004
In a study published June 22 online in Genome Research, researchers describe a new peptide classification and protein inference method that helped them identify around 3,500 proteins in pollen — around 13 times the previously reported pollen proteome.
In the study, the researchers said that pollen represents an ideal biological system for studying developmental process, such as cell polarity, tip growth, and morphogenesis, but noted that most of what is known about pollen development and function comes from genetic and transcriptomic studies, while knowledge about the pollen proteome has been limited to 266 distinct proteins.
In addition to issues surrounding sample collection, "the significant amount of genome duplication in higher plants, combined with the expectation (based on transcriptomics data) that a large percentage of proteins can only be identified by a single peptide, poses a significant data analysis challenge," the authors of the GR study wrote.
Because shotgun proteomics makes it difficult to identify peptides that are unambiguously assigned to one protein, strategies are needed to extract unambiguous protein evidence. In the study, the researchers describe a novel, deterministic peptide classification and protein inference scheme for shotgun proteomics data, "which differs from the existing approaches such as ProteinProphet, EBP, and IDPicker."
Their approach considers the gene model/protein sequence/protein accession relationships and classifies each peptide sequence according to its information content. The method distinguishes unique peptides from those shared by several proteins, and unlike probabilistic approaches, it considers only peptides above a certain confidence threshold after the peptide spectrum matching process, they said in their study. Because it considers the protein-gene model relationship, their classification scheme also enables a"seamless integration with transcriptomics datasets," they added.
Their method identified about 3,500 proteins, expanding the mature pollen proteome by a factor of 13, they said. Integrating their data with published transcriptomics datasets, they report more than 500 proteins that were not previously identified in mature pollen.
This week ProteoMonitor spoke with Christian Ahrens, scientific coordinator for bioinformatics at the Center for Model Organism Proteomes in Zurich, and senior co-author of the GR study. Below is an edited transcript of the conversation.
Why is the pollen proteome map so incomplete? Is it because of inherent difficulties with pollen proteins or because people just haven't tried to map it?
There is an inherent problem … and that is that the pollen, which is the male gametophyte of higher plants, is actually microscopically small. … We were lucky enough to collaborate with Dr. [Ueli] Grossniklaus who's a plant developmental biologist and he basically used a method [that] you can basically think of as a vacuum cleaner where you have different kinds of filters.
And with this vacuum cleaner, you have to basically vacuum the microscopically small pollen that you can hardly see to get enough material. So that's one problem.
But there has been a previous study using 2D gel electrophoresis … and that is a good approach. However, it is clear that for a detailed proteome study, you just do not have enough sensitivity and that shotgun proteomics is basically giving a lot more information, [and] is less biased toward membrane proteins and other protein classes. This has been known for quite a while.
So that is one of the reasons [that with] this shotgun proteomics approach, we could extend the list of identified proteins from 266 if you take together three previous studies on pollen, and we extend it by a factor of 13 and identified about 3,500 proteins.
You also said that this method that you devised and used is really directed at the protein inference problem in shotgun proteomics. Can you describe what this inference problem is?
Compared to 2D gel electrophoresis where you separate in two dimensions intact proteins, the so-called shotgun proteomics method, which was devised in the late 90s by [Michael] Washburn and [John] Yates, basically has a different workflow, so you extract your proteins from your sample of interest, in our case the pollen.
[ pagebreak ]
And then you digest your protein with an enzyme. We used trypsin … and what you end up with is a very, very complex peptide mixture. … And that's another reason we got so far; we have a very good setup for reducing the complexity of the peptide sample, so we do a lot of fractionations. Monica Grobei, the PhD student who did the experimental work, used different protein extraction protocols to really get different types of proteins and, of course, peptides.
The key problem with the shotgun proteomics approach is that you're losing the connection [between] the peptides and the proteins they were derived from. So now … if you've done the analysis with a mass spectrometer you get a large number of spectra. You search the spectra in a protein database — and we used the reference database for Arabadopsis, TAIR 7 — but then you get all these assignments for each spectra that you measure. You get an assignment to the peptide it points to.
The key problem here is now you get all these lists of peptides, [and] you have to now assemble from all this peptide evidence which were the proteins these peptides were derived from.
My co-worker Dr. Ermir Qeli and I devised a very novel deterministic classification scheme that overcomes some of these problems that have been plaguing shotgun proteomics. We take a novel approach.
We also look at the in silico analysis of the database. We basically take all the protein sequences that are in this database, we look at those protein sequences that are distinguishable, so that we can identify them with proteomics methods and differentiate them, and then we compare to this index all the experimental results we get from all our spectra to peptide matches.
And then for each peptide we can give the information content. We can say, 'This peptide unambiguously only points to this protein.' And obviously, this is what people have not done before, and this gives us a lot more information and allows us to put the dataset into different bins.
So we have a very large set of peptides that we know unambiguously point to only one protein. Then we have another group of peptides where we know ... we cannot distinguish among several protein accessions, but they all point to the same sequence and they're all from the same gene model, so this goes in the direction of splice variants.
Some of the splice variants in the database … differ only in their untranslated region, so they don't differ in the protein sequence but [differ] in the regulatory regions and we can detect that.
So what is novel about your approach?
What is novel is the link to the gene model — to consider that one gene model can have many different protein isoforms and to look at [whether] these isoforms are really different in terms of their sequence.
What we show [is we] can distinguish class 1A, class 1B, and class 2, and class 2 would be peptides that can identify unambiguously a gene model but not which of potentially several protein isoforms we detected.
But up to this level is very important because this still can be used to seamlessly integrate with transcriptomic data. Another major thing that we did is this integration, to take this data [and] map it back to the genome.
If you look at all the genome browsers, all the information is basically linked to the genome. You can go up to nucleotide resolution and look at where your coding exons are … and we can now do this also with protein information.
Why do we do this? Because we found here in the pollen proteome … that a number of peptides, if we look at in silico and a theoretical digest of database of different organisms, really [for] a significant percentage of these peptides you cannot assign which proteins they came from, which gene models.
So they could be encoded by different gene models. This is a big problem. Some of the implications of that are that we believe it would be now the time to think about extending the guidelines that have been put forward [by the different journals] where they require two unique peptides.
We suggest that maybe one should think about extending these guidelines to really integrate the information content of the peptide. Two unique peptides are not enough. If you look at the classification that we've done, for example, if you have two peptides of class 3B, these are peptides that could point to different proteins from different gene models. There's no way you can say it is one of those proteins that you have identified.
[ pagebreak ]
And that's one thing that people have not done so far.
Would that call for a different workflow than what people have been doing?
Not necessarily. … We believe we could reduce the amount of ambiguous protein or wrong protein identifications using our approach. We've compared it to ProteinProphet. We see that we provide [fewer] identifications, but these are unambiguous.
And ProteinProphet has another feature … it goes down to select additional peptides below the probability cut-off that we use to add information. And we believe that that is something people have to be careful about, and that if we want to think about reference datasets, and that's clearly one of the key applications here, this is one of the things that people can start to consider — this is the way to go if they want to provide reference datasets to the community that have as [few] errors as possible.
Are people prepared to follow this new workflow? Would researchers have to acquire a new base of knowledge or be trained differently?
Not necessarily. I think what is nice is this approach is fairly simple. That is one of the advantages because this in silico analysis can be done [quickly]. We can provide the code … then [researchers] can classify their peptide evidence and then start from there to generate a protein list.
Certainly, we believe it will be important for us to put the software out. And actually, when Dr. Qeli and I presented this, other leading figures in the bioinformatics field [pointed] out that they would really like us to provide this information. … Then they could eliminate certain errors that typically are made.
What's the plan for putting out the software? Would this be open-access or are you commercializing it?
Clearly open-access, and we're just now thinking about what would be the best format because obviously we believe there are some useful approaches here that people hopefully will want to use.
How much of the pollen proteome is covered by your approach?
Because we are extending it from 266 to 3,500 [proteins], it's a difficult question [to answer]. What was good was that on one hand we had the transcriptomic data that could guide us and there was something like 6,500 transcripts seen by transcriptomic studies.
But what was more important was that Monica Grobei and Ueli Grossniklaus did a large-scale analysis of the literature of the mutants that have been described for pollen. What they could show was that when we look at the mutants that affect this mature pollen … we have seen 70 percent of the described [results] that they could find in the literature, of the described mutants that affect pollen.
We believe that it's certainly not complete; clearly there will be more low-abundance proteins that we did not identify where we would need more material, more preparations, [and] other approaches. But we have come pretty far for this very difficult biological starting material.
Have you been able to tell whether pollen's dynamic range will be a problem?
I think it is a problem with any proteomics approach. … The classical example is the plasma proteome where the dynamic range can span 10 orders of magnitude. People believe that in tissues, typically this will go in the proteomics level to something like six orders of magnitude.
From our spectral counts, we span three to four orders of magnitude in abundance. And I think this is probably realistic.
What do you do with this list of proteins that you've identified? What do your findings mean biologically?
We believe that doing proteomics studies gives you more information than just transcriptomics because [of] all the problems with transcriptomics — with the arrays — this hopefully will change with … less biased approaches.
But obviously the proteomics evidence gives you a lot of information so people could now go into this list and look for, for example, proteins from their favorite pathway, look [for] and take all these specific proteotypic peptides we present and order them and do quantitative targeted proteomics studies and really assess all these peptides in parallel.
And this is something that is comparable to a microarray experiment in proteomics. This is clearly what is driving the hype in systems biology, that you really can create these complete quantitative series which have been, so far, a problem with shotgun proteomics because it is of a stochastic nature.
Is this method applicable for proteomics researchers who are studying other systems such as a mammalian system?
Clearly. Actually … people have asked us whether they can use it even for prokaryotes. It's universally applicable. The higher the percent of ambiguity that you have at the peptide level, the more useful it is.