Researchers from Johns Hopkins and the Institute of Bioinformatics in Bangalore, India, have developed a plasma proteome database that features colorful protein information pages with diagrams of plasma proteins mapped onto their gene, along with protein sequences, gene sequences and other basic information.
“We created this because as a user, this is what I would like to have,” said Akhilesh Pandey, an assistant professor in the department of biological chemistry at Johns Hopkins who founded the Institute of Bioinformatics and spearheaded the creation of the database. “I compare this to Google, because it makes your life very simple, but the technology that goes into it is very powerful,” he said, referring to the popular Internet search engine. “You need to be advanced to give that simplicity.”
Pandey's new database, called simply the Plasma Proteome Database, differs from the European Bioinformatics Institute's new Protein Identification database (PRIDE), which was created in part to serve the Human Proteome Organization’s Plasma Proteome Project, in that it includes only published data. Rather than serving as a working data repository where users dump their experimental data, the database serves as a user-friendly summary of information known about proteins found in human plasma.
“PRIDE is a warehouse, so they’re not going to figure out 'Was this protein shown to be glycosylated 20 years ago?'” said Pandey. “Our source is the published literature. We're not just about identifying proteins in one way — the high-throughput way. There are lots of biologists who go into cells and purify proteins using certain low-throughput methods, and those proteins and isoforms are included in our database as well.”
While PRIDE serves all human tissues, including the plasma, liver, and brain tissues being investigated by various HUPO initiatives, Pandey said he decided to create a database only for plasma because plasma is the most commonly diagnosed clinical sample, and clinicians need to have a user-friendly tool for obtaining information about plasma proteins, including disease-detecting biomarkers.
“Sorting out, accumulating, and presenting in a user-friendly and queryable manner the vast amount of data that has already been generated before and during the Plasma Proteome Project will bring some sense and direction to future research,” the Plasma Proteome Database website states. “This is why the Plasma Proteome Database was created.”
Initially funded in part by HUPO, the Plasma Proteome Database is now funded by collaborations between Johns Hopkins, the University of Michigan and Memorial Sloan-Kettering Cancer Center. There are about 40 to 45 curators working on the database, said Pandey.
The database can be searched in a Boolean fashion by where the protein is expressed, where it is localized within a cell, what its domain structure is, what diseases it is associated with, the function of the protein and by whether or not there are modifications such as glycosylation or acetylation.
“You can ask me — 'Give me a protein that is found in the plasma involved in diabetes that is cleaved in a certain way' — this is not doable in most other databases today,” said Pandey. “The information may be there [in other databases], but it is not accessible unless you are a programmer.”
After a query has been entered, it pulls up a colorful “molecule page” that includes alternative names for the protein, the molecular function of the protein, the gene onto which the protein maps, the protein and gene sequences, the molecular weight of the protein, protein domains and isoforms, as well as links to other protein databases such as SwissProt and the National Center for Biotechnology Information database.
“With the molecule page, in one visual look, you feel as if you have a visual appreciation of the molecule,” said Pandey. “You see all the post-translational molecules, categorized and shown visually, and you see the different isoforms. It's very graphic and visual.”
Mapping proteins onto genes assures that proteins such as immunoglobulins that have many different forms but are part of the same family are given only one name and one entry, instead of a long list of entries based on different peptide sequences.
“If at the end of the day what you're talking about is IgM or IgA, you shouldn't be giving 2,000 entries to an immunoglobulin,” said Pandey. “A similar scenario exists for other protein families. We ask, 'Is it coming from the same gene product? What are the proteins and the isoforms?'”
Pandey and his team began creating the Plasma Proteome Database in January. So far, the database includes 7,961 proteins and protein isoforms mapped onto 4,090 unique gene loci. This is a greater number than the 3,000 plasma proteins identified by the HUPO PPP because it includes proteins from the literature that were found using methods other than mass spectrometry, such as antibodies, Pandey said.
Entries in the Plasma Proteome Database can be annotated. For example, if researchers were looking for a protein to be cleaved, but instead found that it was phorphorylated, they could annotate the database to include that information.
“It may be useless data for them, but it's valuable for the community,” said Pandey. “We want to turn the community into annotators.”
The Plasma Proteome Database is free of charge and can be accessed at plasmaproteomedatabase.org.