By Tony Fong
The National Human Genome Research Institute and National Institute of General Medical Services hope to create a central protein data repository in an effort to better organize not only information about protein sequences and functions that already exist, but also for data that is anticipated to be generated in coming years.
Last week, the two institutes put out a request for proposals to create the repository, which they said would serve as a "comprehensive protein sequence and function information resource that will enable a broad range of scientists to use the large amount of information available on proteins and their functions in a wide variety of experimental and computational applications."
The project is budgeted for a maximum of $6 million per year for three years, with a maximum of $5.5 million for direct costs.
The deadline for filing a letter of intent is April 14. The application is due May 12. The anticipated start date is Sept. 30.
According to the project's FOA, the need for a central repository for protein sequence and function data is based on the amount of data that has recently been generated in proteomics and protein research, as well as the anticipated flood of protein data to come.
Next-generation sequencing technologies will result in a significant increase in nucleic acid sequence data, NGHRI and NIGMS said, which will in turn result in an increase in the number of identified proteins and associated variants.
High-throughput proteomic approaches using mass spectrometry, protein microarrays, co-immunoprecipitation, and yeast two-hybrids have provided researchers in recent years with new ways to investigate protein functions and "the composition of the molecular machines that perform the functions in a cell."
But while such methods can identify protein functions for a large number of proteins in a single experiment, "the majority of information about protein function is still derived from hypothesis-driven experimental work published in the scientific literature," NHGRI and NIGMS said.
"The curation of data from … the literature and directly from high-throughput experiments is most valuable if it is centrally located for access by the scientific community," according to the FOA.
To be sure, "numerous repositories" have been created to store protein-related data, including the Protein Data Bank and the Protein Information Resource, two early repositories for storing protein macromolecular structures, the FOA said.
Genbank and the EMBL Nucleotide Sequence Database were created in response to the production of DNA sequences in order to house gene and translated protein sequences, and newer, high-throughput technologies in the proteomics field have led to the development of Peptidome and PRIDE.
Other repositories have also been developed that focus on specific aspects of protein function and form. In 2008, developers of UniProt released a “complete” set of annotated human proteins [See PM 09/05/08], and last year the founders of PRIDE, Tranche, and PeptideAtlas launched ProteomeExchange "to provide a single point of submission to proteomics repositories." The resource is also meant to “encourage the data exchange and sharing of identifiers between the repositories so that the community may easily find datasets in the participating repositories," according to its website.
However, as new data is generated and incorporated into these public repositories, there is a risk of data overlap and duplication because of how the resources commonly cross-reference annotation from themselves and other repositories.
[ pagebreak ]
"There are a lot of databases out there, and what we want to do is to look at trying to unify some of these datasets," Vivien Bonazzi, program director for computational biology and bioinformatics at NHGRI, told ProteoMonitor this week.
In July 2008, as a first step toward that goal, the NHGRI held a workshop attended by a wide swath of the proteomics and protein-research community to discuss current and future data-storage needs.
Among the key outcomes identified at the meeting was that any new repository to be developed needs to contain both protein sequence and functional information. It should also contain high-quality manual annotations, and "leverage but not duplicate" information from smaller databases. And it should be able to handle high-throughput data and be easy to use.
Based on discussions at that workshop and elsewhere within the scientific community, NGHRI and NIGMS developed their FOA.
The FOA outlines 14 features that NHGRI and NIGMS seek. For instance, the repository should be curated, accurate, stable, and comprehensive, and applicants need to explain and justify how the data will be curated. Also expected are descriptions of quality-control procedures and metrics, and "plans for maintaining stability of the resource."
The repository should also include information on certain data types, such as protein sequences, nomenclature, alternatively spliced proteins, homology and paralogy relations, and family classifications.
"Additional relevant information on gene function should also be included, for example standardized vocabularies of Gene Ontology terms, potential protein interactions, expression patterns, and pathways," the FOA said.
And the repository should also "leverage and integrate, but not duplicate, appropriate data from other existing genomics and proteomics resources," it stated.
For example, in the case of a protein-protein interaction database, rather than "subsume" that into the proposed repository, an applicant should think about linking to appropriate resources and not re-annotate the interactions, Bonazzi said.
"Simply trying to aggregate a whole lot of datasets or databases isn't the way to solve this," she said. "If you're trying to provide sequence and function information about a protein, you want to think about it from the context of, 'What's the information that you need [and] where do you best get it?'
"What we'd like the applicants to think about is what's available now and what data is [forthcoming] and how that is relevant to a resource like this, as well as how it's relevant to the community and its ability to use something like this," she added. "It's that balance between doing the kitchen sink and being the boutique and doing a single thing at a time."
While the emphasis is not on raw data, in some instances, she said, it may be appropriate for the repository to contain such information. Large amounts of sequence data are already deposited with the National Center for Biotechnology Information and the European Bioinformatics Institute, so "I wouldn't see much point in duplicating that. I think that would be a potential waste of resources."
But in the cases of metaproteomic and metagenomic data, these types of information are new enough that the raw data may have value in the kind of repository that NHGRI and NIGMS are seeking.
"There the issue is that a lot of that data is not generated by any kind of databases yet, but I can imagine that you need to think about what you're going to do with that," Bonazzi said. "So where do you store it? Is it sequence data? Is it derived protein data? Do you store the sequence at NCBI or EBI or some other repository, and then you take information from it?"
[ pagebreak ]
"Of course if you store the raw data, then you become a repository or an archive, and I'm not sure that's what we really want to do here," she said. "Here it's trying to leverage effectively metadata of the raw sequences, but I can imagine in some circumstances you may want to stage an area of raw data for the purposes of [testing] what a new area is."
She cautioned that anyone looking to include that in their application would have to consider the costs involved to support such data, "and if that is something they want to do … they'd have to be clear about what their intent was."
She added that although the project is focused on human proteins, "other species are a very valuable resource for understanding human health and disease. Therefore, any data that facilitates that would be appropriate."
While efforts such as UniProt are already trying to carry out parts of what NHGRI and NIGMS plan to do, one researcher said that there is still plenty of need in the field for such a project.
To Phil Andrews, a professor of biological chemistry at the University of Michigan, and whose lab developed and maintains the Tranche database, a repository such as the one that NHGRI and NIGMS are proposing would be "a valuable resource and crucial for the continued success of proteomics and related areas." While existing resources need to be supported "we also need to allow development of new resources and provide for integration across resources," he said in an e-mail.
Backing up this claim is Akhilesh Pandey, an associate professor of biological chemistry, pathology, and chemistry at Johns Hopkins University, and the impetus behind the Human Proteinpedia, a portal for integrating and sharing proteomics data. He said that such a repository is needed, but cautioned that the project may be too ambitious.
Pandey said that he is "a little skeptical" that the project can be done at this point "because it basically mentions everything, and we don't even have resources that can do even one-fifth" of what NHGRI and NIGMS is asking. He does not plan to apply for the grant because he does not have the infrastructure needed to carry out such a project, he said.
The project should be narrowed to include only human protein information, he said, and added that instead of the broad scope of data the project currently proposes, it should seek data "for a few things that are really needed. … Simple things like emphasizing some pathways would be nice."
He added that other things, such as next-generation sequencing-based data, may not yet be useful because "there is very little merger of the genome and the proteome … and right now people don't know how to handle that kind of data."