Skip to main content
Premium Trial:

Request an Annual Quote

UW-Madison Researchers Demonstrate Potential of Custom Search Databases in Mass Spec Proteomics


Researchers at the University of Wisconsin-Madison have completed a mass spec-based analysis of novel splice forms in Jurkat cells using a custom search database constructed via RNA sequencing.

Their analysis, detailed in a paper published last month in Molecular & Cellular Proteomics, identified 57 splice junction peptides not present in the Uniprot-Trembl proteomic database, offering an example of the potential of custom search databases to enhance proteomics discovery efforts.

While proteins are, of course, the ultimate concern, mass spec-based proteomics relies on DNA reference databases for matching experimental spectra and peptide sequences to their corresponding proteins.

These databases are frequently updated, but they are nonetheless incomplete given the vast number of different protein forms in the human proteome and the fact that not all of these forms are necessarily expressed in every cell or tissue type.

These limitations, combined with the rise of next-generation sequencing, have led some researchers to create custom search databases specific to the samples they are investigating. While the practice remains relatively uncommon, UW-Madison researcher Lloyd Smith, leader of the MCP splice form study, told ProteoMonitor that he thinks it could in the future become standard in mass spec-based proteomics.

"We actually think it's the way to go, because sequencing itself is getting so much cheaper and more routine," he said.

In addition to allowing researchers to identify protein forms not present in conventional generic search databases, the technique also has the potential to improve the depth of coverage and peptide matching by restricting the search space to protein forms actually present in the specific sample being investigated, Michael Shortreed, a senior scientist in Smith's research group and co-author on the MCP paper, told ProteoMonitor.

The group has not quantified the difference, but, Shortreed said, they have observed that when using the custom RNA-seq database they "get a good bit deeper coverage."

"We see more different proteins, and we see more peptide spectral matches" compared to analyses done using traditional databases, he said.

"When you think about it, it actually seems a little wacky to use a large disparate database [compiled] from genetically disparate sources when you could use a tailored focused database from a genetically identical source," Smith said.

"I think that RNA-seq data provides the best means of detecting variant proteins, especially those that differ between individual samples," said Vanderbilt University researcher Daniel Liebler, whose lab last year published a paper in the Journal of Proteome Research on using such custom databases.

"Not only does the RNAseq data provide for custom databases for peptide identification, but it also provides confirmation of the putative ID at the transcript level, Liebler, who was not involved in the UW-Madison research, told ProteoMonitor.

Smith said that he saw the method as key to approaching proteomics research from the standpoint not just of proteins but of proteoforms – the many protein variants that compose the proteome.

"I think the idea [is] that the community is going to gradually build out a large database that has millions of entries of the different proteoforms that have been seen ... including genetic [protein] variants as well as post-translational modifications and alternative splicing variants," he said.

Such a database, however, would be too large to be useful as a search database, Smith said, noting that its size would result in unacceptably high false discovery rates.

If, however, researchers used that database as "a huge mothership database" and then created custom databases for "individual samples that brought the complexity down to maybe 20,000 or 30,000 different elements, that could be incredibly useful," he said.

Beyond improving proteomics research in commonly studied organisms like humans, custom database creation also enables analyses of organisms or sample types for which there may not be existing quality search databases.

For instance, in March 2012, Cornell University researcher Jocelyn Rose used RNA-sequencing to build a search database for tomato pollen. Using the database, he and his colleagues identified more than 1,200 pollen proteins, finding that their custom-built database offered comparable results to an established DNA reference database (PM 5/11/2012).

"The gold standard plant reference genome has been Arabidopsis thaliana," an organism with little agricultural value, Rose told ProteoMonitor at the time. Significant work has also been done in rice, but, he noted, by and large proteomics' agricultural reach has been limited.

"What we're arguing is that RNA-seq is a massive enabling platform, and that all of a sudden proteomics is going to be possible in whatever [organism] you want," he said.

Biotech firm Cell Signaling Technology has similarly built custom RNA-seq search databases for its proteomics work, using them to profile B-cell antibody repertoires as part of a proteomics-based antibody discovery method (PM 1/11/2013).

Although the relative ease of current next-generation sequencing techniques has made creation of custom proteomic search databases feasible, the approach is still "a little bit challenging," Shortreed said, adding that he and his colleagues were currently working on a proposal to develop software that would simplify the process.

"In this [MCP] paper, it was a lot of work to handle the data and do all the data analysis and put it all together," Smith said. "So something that needs to be done is reducing that burden of data analysis, which is pretty substantial right now."

One party with a potential interest in tackling this challenge – assuming continued growth in research interest in using custom search databases – could be Thermo Fisher Scientific, which following its purchase of Life Technologies will have under one roof both the next-generation sequencing technology for creating such databases and the mass spec technology for using them (PM 4/19/2013).

Thermo Fisher declined to comment on whether it had any ambitions in this regard, but, Shortreed said, the vendor could prove a great help in progressing the technology were it to pursue such a project.

"I think there's a big advantage in having people who are working together and understanding the challenges of both [NGS and mass spec]," he said. "Because there is a barrier to entry that is pretty high depending on which side of the fence you are on that would prevent you from taking these technologies and doing the exciting things you can do with them. So having people who are at that interface would help a lot."

Of course, as Smith noted, that's more easily said than done. "It sounds great to say that it will all be under one roof [at Thermo Fisher]," he said. "But, in fact, there are going to be two different worlds – the nucleic acid world and the mass spec world – and the software people who work in those two worlds aren't going to be the same. So for someone to get impassioned about the idea and allocate the resources and figure out how to bring everyone together, it would be doable, but complicated."