Unity is possible after all, at least in the world of protein sequence databases. Backed with $15 million in funding from the NIH, three primary sources of protein sequence data will merge into a single global resource.
The new database, called the United Protein Database, or UniProt (eventually available at www.uniprot.org), will combine the current contents of Swiss-Prot, Trembl, and the Protein Information Resource (PIR). Rolf Apweiler, who leads the Swiss-Prot group at the EBI, will head up the international project. Amos Bairoch, Swiss-Prot founder and a group leader at the Swiss Institute of Bioinformatics, and Cathy Wu, professor of biochemistry and molecular biology at Georgetown University Medical Center who oversees PIR, are co-investigators.
Swiss-Prot, a hand-curated protein sequence database, currently holds entries on 114,000 proteins. Trembl, which was created by the EBI and SIB to act as a preliminary “holding” resource for computer-annotated protein data not yet ready for Swiss-Prot, contains 700,000 proteins. PIR, a joint effort between Georgetown University Medical Center and the National Biomedical Research Foundation, was based on Margaret Dayhoff’s Atlas of Protein Sequence and Structure, the first comprehensive collection of protein sequences. It contains 283,000 records.
Dayhoff’s concept of protein families and super-families, defined by sequence similarity, serves as the basis for PIR’s functional and structural annotations, and is the key differentiator between the resources, according to Wu. However, this difference will only enrich the combined resource, she said. “Essentially, what will happen is that we’ll apply our annotation method to the Swiss-Prot and Trembl data,” she explained.
UniProt will retain the pipeline now in place at EBI/SIB and will contain two parts: the Swiss-Prot section for fully annotated entries, and the Trembl section for computer-annotated records awaiting manual curation. PIR records will be folded into the existing Swiss-Prot pipeline. The PIR researchers will eventually cease maintaining PIR, and will instead focus on annotating the backlog of Trembl records.
UniProt should hold records on well over two million proteins by the end of the three-year grant.
Apweiler said that a total of around 100 researchers at the three organizations would be working on UniProt. His team has already started “looking into mapping the way features are annotated in PIR and the way they are annotated in Swiss-Prot,” in order to come up with a clean plan for integrating the two resources. The immediate goal, he said, is to move the data out of PIR as quickly as possible into a single pipeline, a process that should take about two years.
Users should not be affected by the transition, according to Apweiler and Wu. The existing portals for all three sites will remain intact, including the PIR website (pir.georgetown.edu), which hosts other projects not covered by the UniProt grant, and will eventually host a mirror of UniProt as well.
Long-term, Apweiler envisions UniProt as a “hub” for a network of interoperable, cross-referenced protein and proteomics “satellite” databases. The EBI’s InterPro, PIR’s iProClass, as well as protein-protein interaction databases, would all be built on top of standardized reference sequences in UniProt, he said.
Ultimately, the UniProt developers would like to see the database become freely available to researchers from both industry and the non-profit sector. Swiss-Prot is currently supported in part by licensing fees from commercial users, who must purchase a subscription to the database from GeneBio, a Geneva-based firm set up to maintain the database when its funding was uncertain. Apweiler said he’d like to see the commercial era of Swiss-Prot come to an end, but noted that no explicit plans have yet been put in place to ensure this. As for GeneBio, “We have contractual obligations with them that we will stick to,” he said. A spokesperson for the company was unable to comment for this article.
Wu, who rescued PIR from the brink of financial extinction herself, noted that the funding history for protein sequence data “hasn’t been a smooth ride,” but was hopeful that the UniProt grant is a signal that this information is now viewed at the same level of importance as GenBank for DNA sequence or the PDB for protein structure.
“These are fundamental resources that are required — the most basic international bioinformatics infrastructure. They have to be open, so they have to have stable funding,” she said.
The National Human Genome Research Institute is contributing $3 million per year to UniProt. Other NIH participants include the National Institute of General Medical Science, $1 million; the National Library of Medicine, $460,000; the National Institute of Mental Health, $300,000; the National Center for Research Resources, $100,000; and the National Institute of Dental and Craniofacial Research, $50,000.