Skip to main content
Premium Trial:

Request an Annual Quote

NGS-Based Custom Reference Databases Expand the Reach of Proteomics Beyond Model Systems


This story originally ran on May 10.

Though in proteomics the proteome is, of course, the ultimate concern, mass spec-based approaches rely on DNA reference databases for matching experimental spectra and peptide sequences to their corresponding proteins.

This isn't a problem so long as researchers confine their studies to organisms with fully sequenced, well-annotated genomes. It can become an issue, however, for scientists working on less commonly investigated, non-model systems.

According to Jocelyn Rose, an associate professor of plant biology at Cornell University, this has presented challenges for plant and agricultural proteomics, in particular.

"In terms of practical applications, if you [study] anything outside of … [traditional] model systems, you just can't do proteomics – the [peptide] matching is so poor," he told ProteoMonitor this week.

With the rise of next-generation sequencing, however, DNA and RNA sequencing has become significantly cheaper and faster, raising the possibility, Rose said, of creating custom reference databases for proteomics research.

In a study published in the March issue of Proteomics, he and colleagues from Cornell, Colorado State University, and the US Department of Agriculture employed such an approach to build a transcriptome database for tomato pollen. Using RNA-seq on a Roche 454 GS FLX machine to generate the database and an Applied Biosystems 4000 QTRAP for their mass spec analysis, the researchers identified more than 1,200 pollen proteins, finding that their custom-built database offered comparable results to an established DNA reference database.

The results, Rose said, demonstrate the potential of NGS-built databases to expand the scope of proteomics research.

"I think there is stunning potential [for agricultural research]," he said, adding that as growing populations and climate change place increasing pressure on the world's food supply, researchers and funding agencies will likely move more and more toward applied agricultural research.

"That means by definition that the breadth of plant species or microbial species or animal species [being researched] is going to increase," Rose said.

"To date, he noted, "the gold standard plant reference genome has been Arabidopsis thaliana," an organism with little agricultural value. Significant work has also been done in rice, but, Rose said, by and large proteomics' agricultural reach has been limited.

"What we're arguing is that RNA-seq is a massive enabling platform, and that all of a sudden proteomics is going to be possible in whatever [organism] you want," he said.

Christof Rampitsch, a cereal proteomics researcher at Canada's Department of Agriculture and Agri-Food, agreed that a lack of reference databases has hindered agricultural proteomics, noting that common workarounds like using homology-based searches can sometimes lead to questionable protein IDs.

"I see this quite a lot in the literature," he told ProteoMonitor. "Someone will be working on something like sugar cane or banana – something that is a relatively major crop that hasn't been sequenced – and they'll do a homology-based search and just report the best hit they get whether that hit has a score that is barely above the threshold or it's a reasonably good score. They'll report, say, spot number 23 from this 2D gel aligns with ribosomal protein six, or whatever, and then start to draw conclusions from that."

Rampitsch said that his lab is currently using an essentially identical approach to build RNA-seq databases for proteomic research into different varieties of wheat rust. A reference database currently exists for one form of rust, but, he said, "the different varieties of rust encode different virulence genes, so there are very subtle differences" that can't be picked up by searching against this single database.

"So now we've sequenced five different varieties, and we'll be querying those databases individually – variety against variety," he said.

Beyond simply opening up proteomics to new organisms, the level of specificity offered by a custom-database approach could enhance work in existing model organisms, as well, Rose suggested.

"It's a new model where you could use a single tissue as your model for RNA and proteins simultaneously," he said. "That really helps you focus on the exact transcripts, the exact splice variants, the exact peptides that are generated [in a given tissue], and that really helps spectral match very, very precisely."

"We've been doing this with laser capture systems where you go to single cell or tissue types," he added. "If you can do things in parallel, it can be a very powerful approach to understanding. It helps focus on the exact question you're asking, on the exact biological material at that time and place."

RNA-seq data, in particular, has the advantage of offering direct information on transcripts, including splice variants, Rose noted, adding that this is helpful for mass spec-based proteomics work in organisms for which well-annotated genome sequences don't exist.

The sequencing work required to build custom reference databases is fairly straightforward, Rose said. However, he allowed, it requires significant bioinformatics expertise.

"You have to build a pipeline where you can take all the transcript information, do good contig assembly, good gene prediction, assess splice variants," he said. "So you've got to have someone in place who knows what they're doing computationally."

Rampitsch agreed, calling bioinformatics "the real bottleneck in terms of assembling your sequence so that you have a database on your servers that you can search."

Both researchers said they planned to make the databases they'd developed publicly available.

While Rose and Rampitsch's efforts have focused on agricultural proteomics, custom RNA-seq reference databases are being used in other areas of proteomics research, as well.

In March, biotech firm Cell Signaling Technology published a paper in Nature Biotechnology detailing a technique using proteomic analysis of animal B-cells to improve monoclonal antibody production (PM 3/30/2012). Because B-cell antibody repertoires are constantly changing in response to foreign antigens, no proteomic reference databases existed for these proteins.

This meant that, like Rose and Rampitsch, the CST researchers had to build their own reference databases, which they did using NGS on a Roche 454 Life Sciences platform to sequence RNA from splenic B-cells taken from the animals used in the study.