This week, SAP, a German software development firm, and the Technical University Munich's proteomics and bioanalytics department launched ProteomicsDB, a free web-based repository of human proteins and peptides from mass spectrometry experiments that’s based on the SAP Hana platform — an in-memory database system.
ProteomicsDB contains more than 11,000 datasets from human cancer cell lines, tissues, and body fluids that users can explore either by protein or by chromosome. According to its developers, the data in the repository maps from more than 18,000 human genes — about 92 percent of the human proteome.
Bernhard Kuster, the head of TUM's proteomics and bioanalytics department, told BioInform that the idea for the resource grew out of an internal need at TUM for a place to store proteomics data so that it could be mined and interpreted. They also wanted a clearer picture, he said, of how much of the human proteome had actually been identified.
The SAP collaboration was the result of a pre-existing business relationship between Kuster and a current SAP employee, he said. The two had previously co-developed a database system for a small biotechnology company.
SAP was interested in working with TUM because, like other large IT companies, they "have an interest in trying to explore how [to] deal with these large quantities of data" particularly in the context of personalized medicine, he said.
The database is backed by 50 terabytes of storage, 2 TB of random access memory, and 160 processing units. According to Kuster the combination of compute power and in-memory technology makes ProteomicsDB much faster than similar resources such as the Proteomics Identifications (PRIDE) database — a public repository of proteomics data managed by the European Bioinformatics Institute.
Because "most of the data sits … in the computer[s] memory" it can respond to queries much faster than systems that store information on external disks, he explained. This allows "[us] to get the data out in real time … you type in the name of a protein and it will not be more than a second or two before you can actually see the data and … drill down to the level of individual tandem mass spectra … to see underlying evidence."
Other features of the tool include a direct interface to three programming languages — L, C++, and R — which allows more flexible calculations than are possible with standard structure query language. Its interface is built on a JavaScript framework for HTML5 and is optimized for Google Chrome but it can also support Internet Explorer and Mozilla Firefox.
The developers believe that the database will be a source of useful targets for drug development to the pharmaceutical and biotechnology industries, and that it will also be of use in other kinds of biomedical research efforts.
Currently, ProteomicsDB has about 2.96 TB of data. A large portion of the information, Kuster said, was generated at TUM but it also includes data from researchers at other institutions. For example, scientists from the Laboratory of Proteome Research in Japan's National Institute of Biomedical Innovation provided data from a study they did as part of the Chromosome-centric Human Proteome Project — an effort to create an annotated proteomic catalog for each chromosome.
According to a paper published in the Journal of Proteome Research, Shiromizu et al "integrated proteomic and phosphoproteomic analysis results from chromosome-independent biomarker discovery research to create a chromosome-based list of proteins and phosphorylation sites." The data used in the study came from five colorectal cancer tissue and cell line samples. The researchers reported that they were able to identify and categorize "1,278 proteins, including 8,305 phosphoproteins and 28,205 phosphorylation sites …on a chromosome-by-chromosome basis."
Another data source is a study that was done by researchers from the Queen Mary University of London. That study, published in Science Signaling by Casado et al, describes a computational approach called kinase-substrate enrichment analysis that was used "to systematically infer the activation of given kinase pathways from mass spectrometry-based phosphoproteomic analysis of acute myeloid leukemia cells."
ProteomicsDB also contains information from several publicly available resources such as PeptideAtlas, a compendium of peptides from tandem mass spectrometry projects that is managed by the Seattle, Wash.-based Institute for Systems Biology; and the ProteomeXchange, an EBI-led consortium that coordinates the submission of MS proteomics data to existing repositories.
Now, "we are trying to fill up the gaps" in the repository and "we are trying to get the community engaged," Kuster said. The group has added an "adopt a protein" module to ProteomicsDB's home page through which proteomics labs can submit new proteins that aren't already included in the database.
They're also working on extending ProteomicsDB's analysis capabilities to include statistical tools [and] biological interpretation tools for processing raw sequence data, Kuster said.
"Down the road, we would like to link it to some of the other public resources so that we could share the information better," he added. For example, "there are efforts in the genomics community that use similar tools [that] we will try to link [to]."
The partnership with TUM increases SAP's footprint in the life sciences space, an arena into which the company recently began making inroads with agreements that have been primarily on the commercial side.
Last July, Qiagen announced that it was partnering with SAP to develop bioinformatics tools based on the SAP Hana platform as part of Qiagen's efforts to develop a complete "sample-to-result" workflow for next-generation sequencing (BI 7/6/2012).
Then in November, MolecularHealth announced that it was integrating its oncology treatment decision support solution with SAP Hana to shorten the time required to analyze clinical, molecular, and drug data (BI 11/16/2012).