The United Protein Database — the product of the merger of Swiss-Prot, TrEMBL, and the Protein Information Resource — launched this week with approximately 1.2 million entries and a “wedding cake” structure that chief integrator Rolf Apweiler hopes will give users more bang for their buck.
In fact, after Dec. 31, 2004, users won’t need a buck, according to Apweiler. Addressing speculation about the future of the licensing fees that commercial users currently still need to pay to GeneBio for the use of Swiss-Prot, Apweiler said that while the fees will be in effect for the use of the Swiss-Prot portion of the database through the end of 2004 — according to the current agreement — after that, “it’s completely free. All the license restrictions go away.”
The new database has a three-tiered format. At the base is the UniProt archive, which is a non-redundant protein sequence database. The sequences are loaded daily from a variety of public databases, including Swiss-Prot, TrEMBL, and PIR, as well as the Protein Data Bank and FlyBase, for example. Apweiler said there are currently 2.2 million entries in the archive.
The UniProt Knowledgebase — the next layer of the cake — is the non-redundant unification of Swiss-Prot, TrEMBL and PIR data, including information on protein function and classification. The Knowledgebase currently has about 1.2 million entries.
The UniRef, the last layer of the cake, provides a means for researchers to look at classes of protein families by grouping entries according to pre-computed potential redundancy. “The curators always need to evaluate it — is it really the same biological object in these reports? … So we can’t do it in a reliable way automatically,” Apweiler said. Still, the thinking was that since the initial groupings were useful to curators, they might be useful to researchers as well — particularly to structural biologists looking for families, he said. Apweiler added that in addition to the classifications of sequences as 100 percent, 90 percent, or 50 percent identical for a particular species that are currently available in UniRef, another database of 100 percent identical sequence families across several species will become available in January.
The NIH awarded the three-year, $15 million grant for the creation of a centralized database to the providers of Swiss-Prot, TrEMBL, and PIR in October 2002. These providers, meanwhile, were already talking about combining their resources when the RFA was announced, said Cathy Wu, director of PIR. “Amos [Bairoch, of the Swiss Institute of Bioinformatics] and Rolf and I already knew each other, and decided it would be a good idea to join forces; meanwhile, the NIH was very happy to see that they would be able to fund one single worldwide database,” Wu said.
Peter Good, the NIH program director in charge of UniProt, said that the decision to choose the three groups for the project was based on the strength of their plan for eliminating redundancy and for managing their transition, as well as the database managers’ status as leaders in the field. While the final product “looks beautiful,” Good said, “I’d be lying if I didn’t say I was disappointed that it took them this long to get it out — to have this come out a year and some months [into it],” he said. “But it was a tremendous management problem of getting these three cultures together and to get them to come together to agree on a common way of storing the data, to agree just on a common way of presenting the data.” Good noted that the launch did happen within the timeline proposed in the groups’ application, but said, “When you give that sum of money, and there isn’t anything immediately evident, there’s a certain amount of wondering what you got for your money.” Still, Good was optimistic that the database would be a success now that it had finally launched, although “if they don’t do well, we’ll go and consider re-opening an RFA.”
Apweiler also acknowledged that getting the three groups together — particularly integrating PIR — was a challenge, at least from a standardization point of view. “We moved everything from PIR into UniProt in the form of SwissProt and TrEMBL. So that was quite a bit of work because there were different definitions for certain things,” he said. Apweiler has experience with creating controlled vocabularies like those necessary for the database integration: He also leads the HUPO Protein Standards Initiative, which is trying to create standards for the submission and storage of mass spec data, protein interaction data, and general proteomics experimental information (see PM 10-24-03). The PSI data is “data we’re not dealing with directly in UniProt — we will store this in specialized databases,” he said. “But then we [will] make some sort of summary and put some of the data back into UniProt.” Apweiler said there would also be UniProt links to the specialized databases.
Good is hoping that features like these links — and even more extensive integration — will convince him to extend UniProt’s NIH mandate. “We’d like to see added getting them to try to integrate as much of the proteomics data … that has been validated,” Good said. “We would like to see more integration with other databases, other model organism databases that we pay for. And I think that will come.”
Apweiler said that in addition to seeking additional NIH funding at the end of the current grant, UniProt’s managers will also seek funding from European sources and other US funding bodies.
A list of the completed and planned changes and upgrades involved in the creation of UniProt from its three sources is available at: http://us.expasy.org/sprot/relnotes/sp_news.html.
UniProt can be accessed at: http://www.uniprot.org. Swiss-Prot, TrEMBL, and PIR still maintain their own web pages as part of UniProt at http://www.ebi.uniprot.org, http://www.expasy.uniprot.org, and http://www.pir.uniprot.org.