Skip to main content
Premium Trial:

Request an Annual Quote

Emory Team Develops New Software for Building Sample-Specific Mass Spec Reference Databases


NEW YORK (GenomeWeb) – Emory University researchers have developed a new software package for generating sample-specific mass spec databases and used it to assess allele-specific differences in the expression of certain protein variants.

Detailed in a paper published last month in the Journal of Proteome Research, the software package, called GenPro, enables creation of mass spec search databases from genome-level sequencing data, which could improve coverage of protein variants and allow for identification of variants not originally expressed in the tissue being analyzed, said Nicholas Seyfried, an assistant professor of biochemistry and neurology at Emory and senior author on the paper.

Traditionally, mass spec-based proteomic experiments have relied on DNA reference databases for matching experimental spectra and peptide sequences to their corresponding proteins.

These databases are frequently updated, but they are nonetheless incomplete given the vast number of different protein forms in the human proteome and the fact that not all of these forms are necessarily expressed in every cell or tissue type.

With the rise of next-generation sequencing and growing interest in proteogenomics, more researchers have begun generating sample-specific databases, in which they use NGS data from their sample of interest to build a reference database for the subsequent mass spec search.

In addition to allowing researchers to identify protein forms not present in conventional generic search databases, the technique also has the potential to improve the depth of coverage and peptide matching by restricting the search space to protein forms actually present in the specific sample being investigated.

Most of this sample-specific database work has been done using RNA-seq data. However, Seyfried said, he and his colleagues believed they could get better coverage, particularly for their work in neurodegenerative diseases, using gene-level sequence data – so, they developed the GenPro software to enable easy construction of sample-specific databases from whole-exome sequence (WES) data.

Their shift to WES data for constructing their databases was prompted by Thomas Wingo, assistant professor of neurology and human genetics at Emory and first author on the JPR paper, who helped lead development of the software, Seyfried said.

Wingo "has done a lot of whole-exome sequencing looking for rare variants that may be associated with or cause neurodegenerative diseases," he said. "And he believes that exome [sequencing] has higher coverage and gives fewer false positives when it comes to variant calling."

"We had done the whole-exome sequencing on [post-mortem brain tissue from] two individuals," he said. "So, we had that data in hand, and then we thought could we develop [database-building] software that could be applicable to whole-exome sequencing."

Seyfried noted that in addition to the potential for WES-level databases to enable better detection of variants, it has other potential advantages over RNA-seq, particularly for the neurodegenerative disease work his research focuses on.

For instance, he and his colleagues as well as other researchers in the space have discovered that in cases of diseases like Alzheimer's there are a number of proteins deposited in the brain from peripheral tissues, many of which are either expressed at only very low levels or not at all in the brain itself. Using brain-based RNA-seq databases would miss these proteins, Seyfried noted.

Another important consideration, given that much of his work is done in postmortem brain tissue, is the fact that RNA can degrade substantially.

"Certainly, in postmortem human samples you have to kick out a lot of samples because you don't have high-quality RNA," he said. "Whereas the genome is stable, and you can make a personal protein database off DNA from any cell type."

In the JPR paper, the Emory team used the GenPro software to generate reference databases to two postmortem brain samples, which they then analyzed using a Thermo Fisher Orbitrap Fusion mass spectrometer. In all, they identified around 117,000 unique peptides corresponding to roughly 9,300 proteins in each sample. They also identified across the two samples 977 peptide variants.

Using synthetic peptides, they validated six of the single amino acid peptide variants that they identified in the original mass spec experiments, a step Seyfried suggested was key to confirming such variants.

"With one-hit peptide IDs, there's always some caution there," he said. "The user has to go in and say, 'Alright, this is a really good score [indicating a likely true hit]. Let's go synthesize the peptide,' or 'This is a variant that we really care about.' Maybe it's in a gene that causes neurodegeneration. Maybe it's in a gene that's associated with cancer. So, you work from a discovery phase into a validation phase."

The researchers also used their variant data to look at allele-specific protein abundance, a phenomenon that Seyfried said could have implications in various disease processes.

"We're diploid organisms, so you inherit one allele from mom and one allele from dad, and the hypothesis would be that most alleles would be equally abundant," he said. "Our idea was to look as the expression of the variant [peptide] versus the reference and see if they are generally one to one in terms of their expression."

Looking at 429 of the peptide variants they detected, they found that in 95 of them (22 percent) one allele was expressed at levels four-fold or more higher than the other. In 40 percent of those cases, the variant peptide was more highly expressed, while the reference peptide was more highly expressed in 60 percent of cases.

They followed this up with targeted quantitation of three variant and reference peptide pairs, using parallel reaction monitoring mass spec and heavy isotope internal standards to do absolute quantitation, and found that this targeted quantitation data largely agreed with their findings in the larger set of variant-reference pairs.

Seyfried noted that the study's small number of samples analyzed precludes the researchers from making any specific conclusions about the effects of the observed allele-specific expression, but said that it suggests an approach to future research into this question.

"Are there proteins that are less susceptible, or genes that are less susceptible, to imbalance?" he asked. "For coding variants that cause or are associated with disease, how does that affect their allelic balance in the brain? I think we can now start to get at those questions."