Eighty members of the HUPO Plasma Proteome Project gathered in Ann Arbor, Mich., earlier this month to sift through 58,000 protein identifications generated during the project’s pilot phase — and to prepare a manuscript that will include a set of recommended standard operating protocols for plasma proteomics studies.
The manuscript, to be published in a special edition of Proteomics, will include a list of proteins and annotations, as well as a detailed analysis of various platforms and methods from front-end to back-end, culminating in the recommended SOPs, Gil Omenn, director of HPPP, told ProteoMonitor last week. Accompanying this large manuscript will be about 25 other smaller manuscripts from individual workgroups and labs, and from 17 special funded projects, Omenn said. The data for all of the manuscripts is due July 9. Omenn hopes that the peer review and revision processes will be finished by the end of September, with at least the main article published by the end of this year.
This plan diverges somewhat from Omenn’s original expectation that the manuscript drafts would be generated at the workshop itself (see PM 1-23-04). Although some drafts were generated at the workshop, much of the data is still going through further analysis, according to Omenn. “The main conclusions are still in progress,” he said. In addition, the HPPP has not yet seen all the data from the special projects, which were funded with a pool of $200,000 that HPPP set aside. “The data [for the special projects] were due by May 15, and some came in on or around May 15, some are just arriving; a few may not have shown up yet … we’ve [still] got to take in the new datasets and make sure they’re OK, and they meet the standards that we’ve got,” Omenn said.
The workshop did generate plenty of ideas about what work remains to be done. “There needs to be much more attention to the detail in how specimens are collected and handled,” Omenn said. In addition, although many labs spent a lot of time at the workshop going through each other’s technology platforms to compare the effectiveness of various components, these efforts were complicated by the fact that different labs used different thresholds when doing analysis using the same instruments. “Only after we went through all that did we realize that now we have to go back and get the actual thresholds in exquisite detail,” Omenn said. He said that this analysis is going on now.
Another question generated by this data review was the issue of what should count as a protein hit — and the related question of how to classify the 80 percent of the reported protein identifications that came from single peptide hits. “Do you discard all those [single peptide hits]? Or do you just put them in a special bin, and what about their properties, what if sometimes you have a single hit with a very high score, and very nice spectral features?” Omenn asked.
The workshop attendees started answering these questions by first simplifying the dataset. Of the 58,000 IDs that were reported to HPPP, only 20,000 were left after removal of duplicate reports, and a further 9,000 were chopped off by restricting database usage to only the International Protein Index database maintained at the European Bioinformatics Institute — leaving a list of 11,000 proteins. This list is now being organized into several different cuts — one that accepts IDs from two-peptide hits or more, one that only accepts IDs that have been independently confirmed in another sample or another laboratory, and then three independent analyses using various technologies to group the hits. For the independent analyses, Jimmy Eng and Ruedi Aebersold at the Institute for Systems Biology are using probability-scoring software to calculate the probability that a correct sequence was generated based on spectral features. An Australian group led by Eugene Kapp, is collaborating with David Fenyo at GE Healthcare (formerly Amersham Biosciences) to analyze the spectra from the datasets using different search engines. A third group, led by Ilan Beer at the IBM Research Lab in Haifa, Israel, is using a software system called PepMiner to cluster the spectra according to their features without looking at the peptides themselves.
In addition to these projects, other groups are still in the process of completing analyses of data coming from many different datasets, including: quantitative immunoassay data; SELDI data; antibody array data; protein annotation data; glycoproteomics data; and high abundance protein removal data. This last dataset will figure prominently in the paper. “There is no doubt — if there ever was — that removing the six most abundant proteins, or at least the albumin and immunoglobulins, greatly enhances the ease of identifying many other proteins,” Omenn said.
In the end, Omenn hopes to come up with something that is coherent on a grand scale. “Our aim is to get this sufficiently analyzed to be a major advance over anything previously published, and to make publicly available this massive database with sufficiently friendly features,” he said.
Hitting Up Uncle Sam
Much will depend on the results. Following an April meeting with “at least 17 companies and a dozen NIH institutes” regarding funding for the next stage of HPPP, Omenn said that he found plenty of enthusiasm, but that raising sufficient funds looked to be a major undertaking. He hopes that the pilot phase results will help convince government and industry investors to ante up. “Our aim is to get this pilot phase to the state where they will be excited about what was done so far, and then we can define a work plan sufficiently compelling for them to [reinvest],” Omenn said. He added that although “a couple of the major corporate sponsors” have already signed on for a second year of funding, with the NIH “it will be a little more complicated, because they would prefer to do things the way they normally do things, especially with the budget crunch in the administration now.” The NIH would rather the HPPP help the institutes organize responses to their disease-specific RFAs, instead of just putting money into the general initiative, Omenn said. “We’ll have to think how best to do that. We don’t want competition with the investigators, [but] we might be able to help the disease-oriented investigators by bringing together a small consortium of laboratories on the proteomics side when there’s a good match.”
Another option would be to get trans-NIH investment in technology development. Six institutes of the NIH earlier agreed to provide trans-NIH funding for the pilot phase (see PM 1-23-04), but the difficulty with extending broad funding on the technology side is that “there’s a Catch-22 in that the more you show the technology is ready to be employed, the less they’re going to invest in technology development, but if you show it’s not advanced enough, then they may get discouraged,” Omenn said. But of course the nature of proteomics is that technology is “a moving target,” as Omenn described it.
“During the pilot, the [Agilent Multiple Affinity Removal Column] depletion product was tried with some of us and then commercialized. The LTQ and FT-ICR mass spectrometers are coming into broader use. So it’s the nature of an emerging field,” Omenn said.
— KAM