New proteomics software is in the works at the University of North Carolina at Chapel Hill that may soon enable researchers to search their data against a variety of raw genome sequences that have not been annotated, and to determine possible post-translational modifications from top-down proteomics experiments.
This week, Morgan Giddings, an assistant professor at UNC, won a $1 million, three-year grant from the US National Insitute of Health’s National Center for Research Resources to further develop and make widely available her Genome Fingerprint Scanning database search software.
What distinguishes this peptide mass fingerprinting search engine from others is that it can search genomes that have not been completely annotated yet. “Often the sequences are published well in advance of complete annotations,” Giddings said. “It’s pretty neat in the sense that you go straight from proteomic data directly to the raw genome sequence.”
By comparison, Matrix Science’s Mascot allows researchers to search EST data, but not entire raw genomes, Giddings said. And while other researchers have used customized software to annotate genomes using proteomic data, they have not made their tools widely available, she said.
The GFS software, which is currently available to academic researchers, first translates an organism’s entire genome sequence in silico and performs a virtual proteolytic digest of the translated proteins. It then matches experimental peptide masses against this library and looks for clusters where a large number of peptides match within a narrow region (for an in-depth description, see Giddings et al., PNAS, Jan. 7, 2003).
Proteomics researchers with an interest in organisms that have been sequenced but not completely annotated might find GFS especially appealing. Chris Upton, for example, a virologist at the University of Victoria in Canada, has been using GFS in his lab for about a year now to search virus genomes, especially large ones. “It is not affected by incomplete annotations, or errors in annotations, so it can find matches in genes that have not yet been identified in genomes,” he told ProteoMonitor in an e-mail message. Since he has the program locally installed, he can also search genomes that have not been released to the public domain yet.
But even scientists working with human proteins might find GFS useful in the future since some human genes have not been fully identified yet. “And even [with] those that have been identified, there is the whole issue of alternative splice variants,” Giddings said. She and her colleagues are working on improving the software so it can identify specific splice variants of proteins. This feature would also enable GFS to annotate complex genomes using proteomic data, she said.
At the moment, fewer than 20 different genomes — mostly bacteria and yeast — are available for search with GFS on a website. But over the next two years or so, the researchers are planning to expand this list to include larger genomes, starting with small- to medium-sized ones like Tetrahymena, and working their way up to more complex ones like mouse or human.
Large genomes are challenging both because of their size requiremetns and because their genes are split into multiple exons, Giddings said. This necessitates high computational speed and memory: “If we calculate all of the possible peptides that might be produced by the human genome, it’s on the order of several billion peptides,” she said.
At present, GFS is available to academic users from a UNC website (http://gfs.unc.edu). Within the next six months, Giddings hopes to make the software freely accessible to both academic and commercial researchers, and to provide an open source license for download and further development. “That’s a critical thing in general in the proteomics community — having more openly available software that’s not bound by expensive licensing terms,” she said.
Besides the GFS search software, Giddings’ group is also working on further development of a tool called Protein Cleavage and Modification Engine, or PROCLAME, that suggests post-translational modifications based on accurate, intact masses of proteins (for a description, see Holmes and Giddings, Analytical Chemistry, Jan. 15, 2004). “It’s an exploration tool that helps limit the scope of possibilities,” said Giddings. “We used it with several collaborators and had some pretty nice successes, very quickly pinpointing a set of posttranslational modifications, even fairly complex ones.” The program is unique, she said, because it does not use a database to look for known modifications.
Right now, researchers can submit intact mass data and a protein sequence at a website (http://proclame.unc.edu). The program then calculates all the possible standard modifications that are compatible with the data. But the researchers are hoping to expand PROCLAME’s capabilities, in collaboration with scientists at UNC and elsewhere, enabling the software to use fragmentation data from top-down mass spec experiments. “That’s where I see that going next,…You just do this quick top-down measurement, plot it right into the software and get an instantaneous identification and also characterization of what modification is present,” Giddings said.
The ultimate goal, within the next three years or so, is to put both GFS and PROCLAME together and make them widely accessible. Giddings said she hopes her annotation-independent software will not only be useful tools to proteomics researchers, but will also help them make new discoveries. “Coming from a background of sequencing and the genomics world, I have learned that the annotations — as much as people put very good hard work into them — are just not completely accurate yet,” she said. “The more we can avoid limiting ourselves with those annotations, the more novel discoveries we can make with the software.”