ORLANDO, Fla. — Novel and upgraded software for proteomic analysis, new peptide-scoring methods, and new ways to optimize existing search engines made a splash at this year’s PITTCON conference, held here this week.
“I would like to be at the point where you can plug in a genome sequence and have a nice cellular model,” Morgan Giddings, an assistant professor at the University of North Carolina at Chapel Hill, said during a session on informatics methods used to mine mass-spectrometry data (see Proteomics Pioneer).
Giddings acknowledged, however, that being able to automatically translate genomic sequence into function is many years away. In the mean time, he has developed a software and approach for analyzing mass-spectrometry data that enables mass-spec data to be mapped directly to the raw genome sequence rather to proteins. (see ProteoMonitor 12/3/2004.)
By bypassing protein and gene databases, the approach, called Genome Fingerprint Scanning, allows for the identification of proteins that arise from alternatively spliced genes, proteins with mutations, and proteins with post-translational modifications.
“If you build a protein database just predicting off the genome, it’s going to be moderately representative, but it’s certainly not going to represent every possible product at the current time,” Giddings explained. “The idea behind GFS is to totally bypass that — to take the mass spec data and directly say, ‘Here are the places on the genome sequence that that falls.’”
Yet she’s also begun to address the challenge of translating genomic sequence directly into function. To that end, Giddings has been developing a program that integrates microarray data with proteomic and genomic data. The program is in its very beginning stages and has a long way to go before it becomes successful, she said.
“There’s still sort of a lack of resolution with proteomics and microarray data,” she noted.
Thermo to Launch ProSight PC at ASMS
Neil Kelleher, an assistant professor of chemistry at the University of Illinois at Urbana-Champaign, described an upgraded version of his ProSight PTM software for identifying and characterizing intact proteins and their post-translational modifications.
The upgraded program, called ProSight PC, includes five different search modes for matching top-down mass spectra with a “warehouse” of sequence that includes proteins with post-translational modifications.
Thermo Electron will launch the new software at the American Society for Mass Spectrometry conference in June, ProteoMonitor has learned.
The PTM-rich warehouse was created by an approach called “shotgun annotation.” Using this method, various protein forms that can be generated by different protein modifications and different combinatorial patterns, according to certain known rules, are all included in the database.
With the non-upgraded ProSight PTM, the “shotgun annotated” database can be searched in a sequence tag mode, an absolute mass mode, or a hybrid of the two, where the program first obtains possible matches based on sequence tags, and then performs a matching of those sequences based on absolute mass.
In addition, the program can search in a single-protein mode designed to be used once a candidate protein has been identified.
With the upgraded ProSight PC, an additional biomarker search mode is included. In this mode, the software specifically searches for protein fragments found in blood and tries to match them up with top-down mass-spectrometry data.
Prosight PC is available only commercially, Kelleher said. According to Amy Zumwalt, Thermo Electron’s proteomics marketing specialist, Thermo will be the exclusive distributor of the product. Details about pricing will be released at a later date, she said.
ProSight PTM, on the other hand, is available free-of-charge at the website http://prosightptm.scs.uiuc.edu. So far, about 100 people have used the software for their top-down proteomic analyses, Kelleher said.
Optimizing Existing Proteomic Databases
For optimizing existing proteomic search engines such as Mascot and Sequest, John Yates, a professor of cell biology at the Scripps Research Institute, described a program and method of database scoring called PEP_PROBE.
“One of the limitations for database searching is that it assumes minimal errors in the database and minimal sequence variations,” Yates noted at PITTCON.
By using a hypergeometric distribution to make calculations, the PEP_PROBE program reduces the rate of false positive hits to 5 percent, according to a paper published in Analytical Chemistry in 2003.
Similarly, Steven Gygi, an assistant professor of cell biology at Harvard Medical School, described a way to “get a handle on” false positive rates by using a decoy database that consists of forward and reverse sequences.
If a search algorithm generates a hit in the reverse sequence part of the database, it is a known false hit, Gygi explained. He uses various methods, including adjusting search window size to keep false positive rates at one percent or less.
“Having a handle on how much is real [in terms of matches] and how much is not real is really important,” said Gygi. “Using a reverse database is a big part of how we validate.”
Can Databases Account for Variation in Individuals?
Both Yates and Gygi noted that biological variation between individuals within a species is something that databases fail to take into account. In response to that problem, Andrew Emili, an assistant professor of proteomics and bioinformatics at the University of Toronto, has developed software to correct for biological and technical variations.
“Basically, we repeat analyses again and again — a minimum of 20 times — and we work with computer scientists to account for random variation,” Emili explained. “Then we correct for that variance.”
Emili noted that biological variation between individuals is as significant as variation due to technical issues.
“There’s lots of variance between individuals,” he said. “That’s a big reason why we developed a statistical framework to account for variance.”