Researchers associated with the Human Proteome Organization's Brain Proteome Project are currently poring over the data from the project's pilot phase to prepare a series of papers for a special issue of the journal Proteomics that will be published in conjunction with the HUPO World Congress this fall.
But some conclusions from the pilot project are already clear, at least from a bioinformatics perspective. Christian Stephan, leader of the bioinformatics working group at the Medical Proteome Center in Bochum, Germany, which hosted the data collection center for the HBPP, told BioInform in a recent interview that the pilot project identified many of the challenges associated with comparing proteomics data from multiple labs, and also highlighted the need for better standards in the field.
In addition, Stephan said, the pilot project validated the role of the HBPP data collection center, which developed an automated pipeline to reprocess all of the data submitted from nine participating labs — 37 data sets in all, 19 from human and 18 from mouse, comprising around 750,000 mass spectra and 150 gigabytes.
Each of the nine labs sent peak lists as well as their analyzed data to the DCC, Stephan said, "and this is a problem if you try to compare the data from different laboratories because they have their own parameters for identifying proteins."
In addition, Stephan said, the labs were all running different technology platforms — various flavors of 2D gels, 1D gels, mass spectrometers, and chromatography — as well as different types of software. The labs all had different goals in terms of protein identification, as well, with some groups trying to identify every protein present in the brain, and others looking for differential expression between diseased and healthy samples, or across different time scales.
"What we have learned from the pilot phase — and this is maybe the main conclusion — is that we have to product reliable data sets."
Finally, he said, while some proteomics standards efforts are currently getting underway, when the project kicked off two years ago, standards did not exist for storing peak lists, or for the laboratory use of mass specs, gels, or reagents.
The DCC's goal, therefore, was to collect all the data from the project and reprocess it "with a unified parameter set to compare them later between the different stages, but also to compare them between different laboratories."
The DCC used a 128-CPU Fujitsu cluster running ProteinScape, a proteomics data-management software package co-developed by Bruker Daltonik and Protagen, as the heart of its informatics infrastructure, but Stephan said that his team had to develop a number of new algorithms and a new export/import model for the software to handle the data coming from multiple groups.
One of the DCC's first goals was to develop a system to determine the false-positive rate used as the cut-off for determining a final protein list. To do this, the DCC created a "decoy" protein database based on the International Protein Index that shuffled all the amino acids of the original proteins in the IPI. Hits for peptides from the decoy database were considered false positives, and the cut-off was set at a threshold of 5 percent, Stephan said. Within each list of protein IDs, only the top-scoring proteins with up to 5 percent false positives made it into the final list.
To make matters more complicated, the DCC used three different search engines to analyze the data — Sequest, Mascot, and Protein Solver. Each of these search engines was used to generate a list of peptides, and these peptides were then combined into proteins using the ProteinExtractor algorithm in ProteinScape.
Stephan said that the DCC developed another algorithm called Protein Merger to merge these three separate protein lists into a single consensus list based on the scores of the peptides from the three different search engines.
Stephan said that the approach, while complex, provided a final protein list with a high degree of confidence and a known false positive rate. However, he said, it is difficult to assess the level of overlap between the results of the three search engines, because they all identify different numbers of proteins, which makes the cut-off points different.
Likewise, he said that comparing the initial results from the participating labs with the final results is also difficult, because they all used different methods.
"That is another reason [why] you can't compare one lab with another without central reprocessing," he said. "This was the only way, in my opinion, to compare data from different participating labs."
Although funding for the main phase of the HUPO BPP has not yet been secured, the HUPO organizers are already planning for the project, and intend to focus on biomarker discovery for Alzheimer's disease and Parkinson's disease. Stephan said that the bioinformatics pipeline should be sufficient to analyze data from the main phase of the effort.
"What we have learned from the pilot phase — and this is maybe the main conclusion — is that we have to produce reliable data sets," he said.
— Bernadette Toner ([email protected])