A Wikipedia-style portal for integrating and sharing proteomics data was unveiled this month, comprising what its main developer said is the largest and most diverse collection of experimental information on human proteins available.
The portal, three years in the making and dubbed the Human Proteinpedia and partially modeled after Wikipedia, is described in the current edition of Nature Biotechnology.
Akhilesh Pandey, associate professor of biological chemistry, pathology, and chemistry at Johns Hopkins University, who spearheaded the creation of Proteinpedia, said that proteomics doesn’t so much need more data from protein studies as access to existing findings.
“I feel that we are using a lot of taxpayers’ money … to generate all of the data, and we somehow magically feel that a resource such as PubMed should basically make all this content available to everyone,” Pandey told ProteoMonitor. “And that is just not true.
“Even if the data is out there, it’s very difficult for the people who could use the data to know that [it] exists. That’s because data has so many facets today, and until now people have focused on genomic data, and that’s relatively straightforward,” he added. “We ventured into something that was even more difficult that genomic data to start with.”
Since Proteinpedia was started, it has amassed 15,230 human proteins and 203,293 annotations. In addition, it contains more than 4.5 million MS/MS spectra, 138,487 protein expressions, and 17,108 post-translational modifications.
To be sure, several other projects aimed at cataloging information about proteins have been around for several years, including UniProt, the PeptideAtlas, and the PRoteomics IDEntifications database. But according to the Nature Biotechnology article, Proteinpedia differs from other protein data repositories in several ways.
For one thing, Proteinpedia can accommodate data from different platforms such as mass spectrometry, yeast 2-hybrid assays, protein/peptide arrays, and Western blots. Contributors can also annotate data for six features — protein translational modifications; tissue expression; cell-line expression; subcellular localization; enzyme substrates; and protein-protein interactions. No other repository allows annotation of so many features, Pandey and his co-authors write.
All data submitted to Proteinpedia can be viewed through the Human Protein Reference Database “in the context of other features of the corresponding proteins,” according to the article. “To aid comparison and interpretation, meta-annotations pertaining to samples, method of isolation and experimental platform-specific information are provided.”
And finally, because the data submission process is simplified, a researcher with no technical expertise can log in and submit data. Data can be entered directly at Proteinpedia’s website.
“This is ultimately going to be the encyclopedic reference of these proteins on the web where you can find all about these proteins,” Pandey said. “The real key will be that different users will find different uses [for] the same data.”
The data will have as much use for those working in cancer and other disease areas, he said, as those working on protein-protein interactions or on cell signaling.
“This is ultimately going to be the encyclopedic reference of these proteins on the web where you can find all about these proteins.”
To date, more than 71 researchers and their labs have contributed data to Proteinpedia. They include William Hancock at Northeastern University, Richard Simpson at the Ludwig Institute for Cancer Research, Sam Hanash at the Fred Hutchinson Cancer Research Center, Matthias Mann at the Max Planck Institute of Biochemistry, and Natalie Ahn at the University of Colorado at Boulder. Vendors such as Agilent Technologies and Cellzome have also submitted data to Proteinpedia, Pandey said.
According to Hancock, one of the differentiating aspects of Proteinpedia is its global scope.
“We have to tease apart the environmental and racial differences in the development of disease. We can’t just sit in one area of the world and do [a] defining study of a particular disease,” Hancock said. “I think the mass spectrometers are performing quite well now [and] we can characterize a large number of proteins, but the proteome is very complex and I think we need to do it on a global basis, and not have different groups go off and do their individual proteins and duplicate the work, but really merge the data together.”
Keith Waddell, LC-MS applications manager for Agilent, said that with so many different methodologies and platforms being used for proteomics work, getting to the data is a challenge.
“What typically happens in this field is that people focus on mass spec, or Western blots, or microarrays,” he said. “It’s unusual to have [all that data] pulled together.”
The company shared human HeLa data with Pandey to submit into Proteinpedia.
Pandey is “integrating proteomic data from immunoprecipitation mass spec, Western blots, immunohistochemistry, fluorescence, microarrays,” Waddell added. “There’s a whole gamut of information that he’s pulling together.”
Still, Pandey concedes that 71 participants is a “low number,” and eventually he would like to see contributions from the 500 to 1,000 laboratories he estimates are generating proteomics data.
With so much proteomics data being generated, there increasingly has been a debate within the community about the quality of the work being done and the data that comes out of the research. Proteinpedia, however, will accept any and all data as long as there is experimental proof supporting their findings, Pandey said. He will leave it up to the proteomics community at large to decide which data has value and which should be dismissed.
“We have to be democratic. We cannot have our own arbitrary parameters,” he said. “So we are hoping that … people will police themselves, and it will eventually be only good data.”
Unlike with Wikipedia, only the original contributor of data can edit his or her data.
Storing and disseminating the raw mass spec data is performed through the Tranche file-sharing network supporting ProteomeCommons. All other data is available through the Proteinpedia website.
The mass-spec data sets are currently being hosted by more than 16 servers in triplicate. Maintaining Proteinpedia is team of about nine people at the Institute of Bioinformatics, a non-profit based in Bangalore, India, that Pandey founded.
As Proteinpedia grows, it will need more storage capacity for the data. Pandey said he recently asked Microsoft and Google for help on the project. He had not received any responses yet, he said.
“I feel that until now, what has been done …is we have been building castles in the air,” he said. “We all come out with proof-of-principles studies of systems biology, but systems biology can only build on a strong foundation where we know a lot about each component.
“So I feel that we are enabling systems biology and … of course we will maybe start to harness this data and harvest it to try to make connections that have not been made before,” he said.