Mass spec-based proteomics research typically uses protein reference databases for matching experimentally observed peptide fragmentation spectra to spectra predicted by large protein sequence databases like the UniProt Knowledgebase or Ensembl database.
These protein databases derive the bulk of their sequence information from translation of DNA sequences submitted to nucleic acid databases. However, these nucleic acid databases are not entirely stable – with gene models changing over time as the field incorporates new information and understandings.
This instability in the underlying genomic databases leads to instability in the protein sequence databases derived from them. And this, in turns, can lead to instability in proteomic datasets – cases where proteins identified in an earlier experiment disappear or take on new identities when the spectra are reanalyzed using later or different versions of a protein reference database.
"If you take an old [proteomics] dataset and you run it again today against the current up-to-date sequence database, there is a fairly good chance you will find new peptides," Ghent University researcher Lennart Martens told ProteoMonitor. "And if you take the peptides [originally identified] and try to map them against the current database, you might find that a lot of those peptides are gone."
Martens cited an example from his own work. Several years ago he was reanalyzing a dataset and discovered that a protein he had originally identified as human had since been reclassified as a rat protein.
"It was found in [human] blood, so you never know what these people ate," he said. "That's the only logical conclusion I can come up with. How else can you have rat proteins in your blood?"
Most instances of peptide disappearance or reclassification aren't as curious as that, but, said Martens, the phenomenon does present an issue for proteomics research, and one that he suggested the field has shown little interest in tackling.
In the early days of proteomics, Martens said, reference databases were changing so quickly that oftentimes researchers would notice that they had gained and lost peptide identifications in between the time they did their experiments and wrote their manuscripts.
Now, with reference databases considerably more stable – particularly for human proteins – this is a less of a problem. However, he said, researchers can still see significant changes in identifications when shifting from one reference database to another.
For instance, in a study published in 2012 in Analytical and Bioanalytical Chemistry, Martens and colleagues attempted to re-map peptides from some 1,400 proteomics experiments contained in the European Bioinformatics Institute's PRIDE database. Of the 218,679 peptides originally identified in these experiments, 48,703 could not be matched to UniProtKB/Swiss-Prot proteins, "most likely," the researchers wrote "due to sequences that were originally obtained from other databases, for which no equivalent is currently present in UniProtKB/Swiss-Prot."
"One in five peptides, the majority of which have been recorded in the last two years or so, were simply not identifiable against the Ensembl genome for human, which is the most extensively annotated species we have by far," Martens said. "So just imagine how bad the situation is for mouse, let along something like pig or dog."
By and large, expert proteomics researchers are well aware of this problem, said EBI researcher Juan Antonio Vizcaino, author on a 2011 Molecular & Cellular Proteomics paper discussing the matter.
Less experienced researchers, however, are "not very aware about this, in general," he told ProteoMonitor.
The issue can affect not only protein identification, but quantification, as well, Vizcaino added, noting that it's possible that "a peptide used for quantification of a protein disappears in the next version of the database."
Like Martens, Vizcaino noted that the situation, particularly for human data, has improved significantly as the underlying genomic databases have grown more stable.
But, he said, "it is still changing. There is going to be a new assembly [of the Ensembl] database released this summer, for instance, and some of the gene models will change, and that will mean that some of the proteins will change, as well."
One thing that could help mitigate the problem, Martens suggested, would be a better effort by the proteomics community to use its data to improve these reference databases.
"People who make these protein sequence databases [have to determine] which of these bits of the genome actually make proteins and what these proteins look like," he said, noting that proteomics could provide experimental data on observed proteins that would be useful in this process.
But, Martens said, "proteomics people are not very active in putting their proteomics data back into the protein sequence database and gene prediction pipeline."
"The big labs are not really contributing their complete published proteomics data in a straightforward way to people like Ensembl or UniProt so they can use it to make better protein sequence databases," he said.
In general, the field doesn't seem particularly interested in the problem of shifting protein IDs, Martens said. "Nobody seems to care very much. ... When it is pointed out people take an interest, but then everyone forgets and moves on and we all just assume that these databases are moving targets."
One reason for this seeming apathy, he suggested, might be that the affected proteins are typically little studied, poorly annotated molecules. "People think, these are just predicted proteins, so we can ignore them because they aren't that interesting," he said.
"But that is a very awkward reasoning in a way," he added. "Because you would expect that people who make a living analyzing proteomes would be particularly happy to find something new, a protein that no one has characterized before."
While the ongoing stabilization of the underlying databases will continue to improve the situation, the nascent trend toward custom reference databases could create a new set of related challenges, Martens said.
Taking advantage of the rise of next-generation sequencing, some proteomics researchers have begun creating custom reference databases specific to their samples of interest (PM 5/10/2013). By including proteins unique to a given sample not present in more generic databases and limiting the search space to proteins actually present in the sample of interest, such databases enable researchers to go deeper into the proteome, potentially improving coverage.
However, Martens noted, the proliferation of such custom databases could exacerbate the problem of shifting peptide IDs in the absence of good systems for maintaining these databases.
"Nearly everyone who does proteomics research these days searches against one of the standard databases, and these databases are all very nicely structured and run and organized," he said.
For instance, if a researcher is interested in seeing what peptides are identified in a dataset when searching against a specific version of UniProt, they can access that specific version "by clicking a few buttons on the Internet," Martens said.
In the case of custom databases, however, "it's not just one institutionalized database that runs nicely," he said. "It's individual labs that have to put these databases out in the public domain."
"Online resources, especially downloadable files, tend to have a very short half-life," he added. "A server gets formatted and goes offline; a post-doc or graduate student leaves, and everything goes away."
Compounding this, Martens noted, is the fact that the information in the custom database is, by design, more unique than that of the standard databases.
"Say one version of [the] SwissProt [database] goes offline – you can take the previous version or the next version and it will not be that different," he said. "Bit if I sequence an individual cell population, if I lose that database, I can never get it again."
"People are very excited about [creating custom reference databases] and they see lots of potential," he added, "But it is about time that we start thinking about" the issues involved in their storage and sharing.