For a significant portion of his 35 years Akhilesh Pandey has been plagued by the question, “Just how many human proteins are there anyway?” The India-born biochemist, who studied under Harvey Lodish at MIT and Matthias Mann at the University of Southern Denmark, laments the way, for instance, that mass spec experiments generate 25 separate data entries for one protein.
But now, he says, if anyone is going to settle the protein-count question once and for all, it will be him and his team with what they hope will be the definitive proteomics data resource, the Human Protein Reference Database — “a centralized platform to visually depict and integrate information pertaining to domain architecture, post-translational modifications, interaction networks, and disease association for each protein in the human proteome.”
Since Pandey quietly unveiled the database (available at http://www.hprd.org) in March, he’s had 1.5 million hits through word-of-mouth advertising, and he’s been invited by Protein Standards Initiative coordinator Henning Hermjakob to join the project. Pandey expects wider attention after the database is officially debuted in a paper scheduled for publication in Genome Research next month.
Pandey, an assistant professor at the Johns Hopkins University Institute for Genetic Medicine, exhausted his life savings to establish the nonprofit Institute of Bioinformatics in May 2002. In a Bangalore, India, office park, 35 biologists and software engineers work in tandem with Pandey’s 10-person lab at Hopkins to manually curate the Human Protein Reference Database. In a little over a year, by reading on average between 10 and 20 papers per person, per day, they’ve extracted and classified protein references from more than 300,000 scientific articles. Current HPRD stats, according to the website: 2,750 proteins; 10,534 protein-protein interactions; 417 domains; 2,000 post-translational modifications; and 25,050 PubMed links.
HPRD appears to be competing with some established efforts, such as BIND, Interpro, and UniProt — the planned merger of Swiss-Prot, Trembl, and the Protein Information Resource — as well as commercial efforts such as the BioKnowledge Library available from Incyte’s Proteome division. Therefore, in a FAQ section on the website, Pandey goes so far as to answer those who would ask, “Why create yet another protein database?”
The site explains: “We believe that biological databases are still in their early stages and no protein database can be considered as an established standard…We want to offer biologists the possibility of choosing instead of imposing one database by default.”
Pandey added that the manual curation process at the heart of the database’s creation offers a key advantage over other resources that rely on text-mining algorithms. “The slow way is the fastest way,” he said.
Site designers have also been careful to create a user-friendly interface, simple graphics, and a browsable database with a query system that allows users to retrieve proteins based on a large number of parameters or submit their own data or comments. Clicking on a protein brings up a summary page with tabs offering additional details, such as alternate names, protein and DNA sequence, interacting proteins, expression sites, post-translational modifications and substrates, diseases, and external links to OMIM, Swiss-Prot, Unigene, Locus Link, and other resources.
The database was built using Zope, an open source web application server written in Python. “Most bioinformaticists are familiar with Perl, but this is simpler and more powerful,” Pandey said. “We think this will be the next standard language in bioinformatics.” An object-oriented approach was chosen over the traditional relational database format because “classical technologies are not suited for biological information such as protein data.”
Why pour your life savings into such a project? “I have a family that believes in my vision,” Pandey said. “In the US you have to wait for funding, you have to put on a public show. This is reverse engineering. The funding will come later.” Ultimately, Pandey said he envisions the Bangalore institute evolving into a center of systems biology excellence.
The content of HPRD will be freely available to academic researchers upon publication of the Genome Research paper. Commercial entities will have to pay a fee for use of the data, under a licensing agreement similar to that of Swiss-Prot. The underlying software used to create the resource will also be freely available, under the LGPL, when the paper is published.
— Adrienne J. Burke