In the hunt for biomarkers, protein informatics vendors like Genedata, Rosetta Biosoftware, and Agilent are ramping up their software to handle larger datasets from increasingly complex experiments and are working with academic labs to deepen statistical analysis on large sample sizes.
Last week, Genedata announced a “wide-ranging” protein biomarker collaboration with the University of Alabama at Birmingham’s Comprehensive Cancer Center Proteomics Facility.
Also last week, Rosetta Biosoftware launched version 3.2 of its Elucidator proteomics platform, which includes new workflows for biomarker discovery and translational research. Meanwhile, Agilent Technologies officials told BioInform that the company is upgrading the proteomics capabilities of its GeneSpring suite, and in the midst of creating a standalone proteomics software product called Mass Profiler Pro.
These vendors are trying to keep up with a trend that has emerged in proteomics over the last five years, in which experiments have moved from studies that compare one mass spec run to another to studies involving hundreds of patients along with several replicates to gain statistical power.
“Right now the whole industry is moving toward more complex experiments,” Andrey Bondarenko, head of proteomics R&D at Rosetta Biosoftware, told BioInform. Proteomics researchers are looking to differentiate between “good” and “bad” features in datasets of millions of features to decide which require follow-up investigation.
“You are trying to narrow down your problem as much as you can, and we build the statistical tools which permit scoring of these features,” Bondarenko said, for example, to see if peaks are “real” and reproducible across the replicates.
Genedata’s US managing director Jens Hoefkens also noted that labs are moving away from low numbers of samples, since they “can’t get any reliable statistics out of it.” The company’s Expressionist software “really shines” in experiments that involve hundreds of samples, he said, since it “has been designed from the ground up to be able to scale to these large datasets and provide sound statistical results.”
Several researchers confirmed that the large numbers of samples required for biomarker discovery and clinical proteomics is driving demand for improved statistical methods.
“I want to bring the clinical and mass spec [worlds] together with a high level of statistical analysis on the back end,” said James Mobley, who directs the Comprehensive Cancer Center Proteomics Facility at the University of Alabama. Finding a biomarker that can lead to a more directed cancer therapy is “what I live, eat, sleep, and dream about.”
Arthur Moseley, director of proteomics at Duke University’s Institute for Genome Sciences and Policy, said that he is looking for technology that “builds bridges between projects and between results from different laboratories,” which will “transform proteomics from one-off type experiments comparing test and control samples within a specific project into a systems biology experiment, where we can compare data obtained in laboratory A this month with data obtained in laboratory Z three months from now.”
Presently scientists are “making more critical assessments of their data qualitatively,” but the next “frontier,” he said, is addressing quantitative measurements in terms of accuracy and reproducibility in a statistically significant manner, thus enabling comparisons between datasets and between laboratories.
This, he said, will transform proteomics into a systems biology approach. “You cannot do that without robust software and hardware.”
Seeking a Real Signal
Moseley’s lab is using Rosetta’s Elucidator, which he said allows his team to “not only prosecute projects more efficiently, but to mine the data in much greater detail than with any software we have worked with before.”
Elucidator has “more than 20 customers” split evenly between academic and pharmaceutical and biotech customers, Yelena Shevelenko, Rosetta Biosoftware’s general manager, told BioInform. “Big names” are among them but nondisclosure agreements do not allow her to name them, she said.
Rosetta has recently added protein biomarker discovery service providers to its customer base, she said, but declined to disclose their names. “These are the kinds of companies that serve pharmaceutical companies as they outsource more and more of their research activity,” added Kristen Stoops, director of strategic marketing and alliances at Rosetta Biosoftware.
With the launch of Elucidator 3.2, Rosetta has added support for new instrumentation and data types, such as Waters’ IdentityE High Definition Proteomics System and Thermo Fisher Scientific’s LTQ ETD instrument.
The software currently doesn’t support some instruments, such as triple quadrupole mass spectrometers or MALDI TOF/TOF systems, but Bondarenko said the company is planning to enable Elucidator to integrate those types of results.
The backbone of Elucidator is a proprietary algorithm called PeakTeller, which allows researchers “to align hundreds of images, detect level of noise, remove background, and annotate this information,” Bondarenko said.
PeakTeller is “evolving all the time,” he said, particularly as the number of samples grows. Users want to “dial into the sample, to measure low abundance proteins,” he said. “With growing complexity of experiments, users need statistical power and robustness of the [software] system.”
In version 3.2, PeakTeller has gained the ability to leverage information from all the replicates in experiments to be able to score the quality of peaks, he said. It delivers users a peak confidence score for feature-filtering, so they can threshold or filter out what they want to concentrate on.
Duke’s Moseley said that the new version of PeakTeller “does offer advantages” over the previous version. “I can see the improvement in the PeakTeller alignment in our data sets,” he said.
Elucidator 3.2 also includes new workflows to support stable isotope labeling with amino acids in cell culture, or SILAC, and other labeling technologies that are “getting very popular in phosphoproteomics,” said Bondarenko.
While there are some open source software tools that enable this analysis, Elucidator’s ability to find the peaks from the labels with PeakTeller without annotation “differentiates us from commercial competitors,” he said. “Most of the algorithms out there use annotation,” he said, and require users to know the peptide sequence.
Elucidator is built on a client-server architecture. Computationally intensive tasks are accomplished on the server side, which “provides scalability on many different levels: the size of the individual studies; the number of studies; the amount of data managed; and the number of concurrent users performing simultaneous data analysis,” Bondarenko said.
Each individual run of a sample on a high-resolution mass spectrometer can generate up to 10 gigabytes of data, he said.
Some customers have reported that large jobs take overnight to process, but Bondarenko said that this is not a bottleneck relative to the month or two months it often takes to acquire data from the instrument in a large-scale proteomics experiment.
The PeakTeller framework is built to “be able to scale,” allowing users to use parallel processing. If a study has 200 samples then the system needs to process 2 TB of image data “to detect real signals and separate them from noise,” Bondarenko said.
As these tasks are CPU- and memory-intensive, they would require supercomputing power “in the absence of a high-performance image processing framework such as that developed as part of Elucidator system,” which allows users to run the software on “reasonably priced hardware,” he explained.
Moseley’s lab at Duke runs Elucidator on a 16-processor server with 64 gigabytes of RAM, and has 28 terabytes of storage with another 16 terabytes on the way. “To do these types of reproducible quantitative and qualitative proteomic experiments requires a significant investment in terms of hardware and software,” he said.
“I want to bring the clinical and mass spec [worlds] together with a high level of statistical analysis on the back end.”
“Rosetta does scale very well for us,” he said of Elucidator. “We have come across no limit in the number of samples that we can [process],” he said, adding that he has done studies with several hundred samples. “I have seen limitations with other software packages whose names I will not mention.”
However, increasing the number of samples to several thousands would lead to computational and IT storage bottlenecks, he said. Processing 60 samples means about 2 terabytes of data. “That’s not a Rosetta issue, just a background IT issue that needs to be considered,” Moseley said.
Stoops said that the company is looking to integrate Elucidator with Rosetta’s Resolver software for gene expression and its Syllego software for genotyping analysis.
There is already a “bridge” between Elucidator and gene expression data from Resolver, so users can co-analyze results, she said. “We will be expanding on that ability to cross-analyze different types of data with the integration of the Syllego system, which supports currently genotyping data and copy number variation, into a common platform.”
Elucidator currently has a “limited capability” to handle clinical data, Bondarenko said, but the company does have plans to integrate clinical content, such as patient treatment histories “a little further out than 2009,” Stoops said.
UAB’s Mobley said that his team has evaluated various proteomics software packages and collaborates with hardware vendors such as GE Healthcare, Bruker, and Thermo Fisher.
To test software and instruments, he runs samples with bovine serum albumin digested into 50 peptides of varying intensities and adds it in increasing concentrations to liver lysate, a complex tissue mixture. “We found conclusively that depending on how you preprocess, you can find all kinds of different ions that aren’t there,” Mobley said.
Using that standard set, “we give it a fair shake,” he said, adding he is willing to share the generated data with other researchers.
In biomarker research many variables can skew studies. “The one thing that is not being done correctly across the broad is spectral preprocessing, [which] includes peak picking and binning,” he said. Peaks may shift in mass or time a little but those shifts may not be separate peaks. “It takes quite a lot of algorithms to figure out which ones really fit together,” he said.
Other variables include in non-tagged proteomics, how and when researchers normalize their data, which can lead to variability that “has nothing to do with the proteome … it’s an anomaly introduced by preprocessing.”
Mobley said that one benefit of Genedata’s Expressionist is that it offers various approaches for data normalization schemes, to “peak-pick and bin appropriately.” The “beautiful aspect” of the software, he said, is that “you can do anything in any order you want.”
He also cited the statistical power of the software as an advantage. “I do my own stats, and I know when someone is pulling my leg,” he said. “These guys know what they are talking about.”
Mobley noted that Rosetta also offers good statistical capability, but it doesn’t have some of the features that Expressionist does.
“Rosetta does a great job if you have high-resolution mass spectrometry, LC/MS on an LTQ [linear trap quadrupole] Orbitrap-type system,” Mobley said, but “it doesn’t have the back-end for clinical [data].”
He chose Genedata for his lab because he can import genomics data as well as clinical databases into the Expressionist platform. “I can analyze any patient information, any chemistries, epidemiologies, in addition to the genomics and proteomics,” he said.
“More importantly, from the proteomics end, I can use any kind of instrument. I can use MALDI for tissue analysis, LC/MS, maybe lower resolution, which most of us have access to, and then we can put in, of course, high resolution [data], which Rosetta handles very well,” Mobley said.
Another facet of Genedata’s Expressionist, he said, is that the firm eschews a “black box” approach to its software. “With Genedata, at every step of the way in the workflow you can see what is going on, you can change what’s going on, you can manipulate to fit the type of spectroscopy you are doing. At the end of the day you can evaluate what’s going on,” Mobley said.
The key advantage of effective software, Mobley said, is the amount of time it can save in the proteomics workflow. As an example, he cited a recent request by a surgeon who gave Mobley 50 urine samples — 25 from patients with chronic pancreatitis and 25 with prostate cancer — with a request to see if proteomic analysis could reveal markers to differentiate the two.
Mobley said he was able to extract low molecular weight proteins in about an hour, analyzed the samples by MALDI-profiling in another hour, ran the samples on the LTQs overnight, “then analyzed [the data] on Genedata, everything: MALDI and LTQ data, which you can’t do with Rosetta, and do all that by the next day, and we did.”
The analysis, which lasted a week, yielded no statistical differences between samples, he said. Although that may not have delivered the perhaps desired research result, “we didn’t spend a year doing it, we spent a week. Big deal.”
In terms of software development, he feels the firms need to continue to make the software easier to use, and that the “back-end” still is lacking, “to get all that epidemiology, to add in the clinical information, like the grade of disease, age, gender, ethnic background, to be able to bring the epidemiology in along with any genomic analysis, proteomic analyses. It is a huge undertaking,” he said.
Genedata’s Hoefkens said that one reason the company is collaborating with Mobley is because of the diversity of technologies he applies: various types of mass spectrometry technologies, including LC-MS and MALDI.
In labs, scientists often face a situation in which “you end up with as many pieces of software as you have instruments,” Hoekens said, whereas Expressionist is intended as an “aggregator, integration platform.”
The strength of Expressionist is its “unparalleled scalability,” he said. “The preprocessing of these hundreds of gigabytes of data from mass spec data is really where we bring something very unique to the table.”
Almost all the other tools in the marketplace, which he declined to name, “work on the idea of taking data, doing peak-picking and then working on peaks, doing matching, background subtraction, alignment.
“What we do is we work on the raw data level, we don’t do any reduction in data until the very last step when we want to do statistical analysis,” he said.
The possibility of working on the raw data in this fashion is an option Duke’s Moseley called “interesting,” adding he would like to see “qualitative metrics” showing that “data processed through their system is quantifiably different from the data processed through another system.”
Genedata’s Hoefkens acknowledged that some researchers criticize the approach of working with raw data as “too slow.” Users raising that criticism about speed “just don’t realize how much that software is doing,” Mobley said. Hoefkens said that his firm has gone to “great lengths” to assure the software performs well.
“You get better data out of this approach. … We’ve done the comparative studies with customers and you get them faster,” Hoefkens said. “Even processing all of the raw data we are still faster than some of the packages out there, but they move to reduced datasets much, much quicker.”
Mobley said he needs to store about four terabytes of data “every couple of months” and has an IBM BlueGene in his facility. “I’ve got 2,000 [nodes] on the BlueGene I can use at any one time, get at an answer within seconds,” he said.
However, he noted, “we only recently started running this on a higher node computer.” His group previously used a high-speed workstation that cost around $4,000.
“People say [proteomics software] is just too expensive,” said Mobley. He counters by saying that although he still needs to consult with statisticians and IT specialists, if he wrote his own tools, he would need the expertise of a full-time staff. “That’s going to cost good money every year,” he said.
At a recent scientific meeting, Hoefkens heard a scientist explain it had taken his team about a month to acquire data from an LC/MS experiment series. “That’s reasonable,” he said. The scientist then said it took his group several months to analyze the data. “And I thought that is just ridiculous,” Hoefkens said.
“My approach is that it should always take you longer to get the data off the instrument, than it should take you to analyze them,” he said. “So what we have — large datasets, hundreds of gigabytes — we process in a matter of hours, and that is not on a BlueGene, that is on regular hardware, something that costs a couple of thousand dollars, [like] a small Linux PC.”
Going All Omics
For 2009, Agilent is planning a number of changes for its GeneSpring software suite, explained Thon DeBoer, Agilent’s bioinformatics product manager. Since Agilent acquired the product from Silicon Genetics in 2004 it has continued to develop it, he said. The suite includes GeneSpring GX 10.0 for desktop expression analysis, GeneSpring GT 2.0 for genotyping data, GeneSpring MS 1.2 for biomarker discovery using mass-spec data, and GeneSpring Workgroup 10.0.
Through Agilent’s development partnership with Strand Life Sciences, “we wanted to focus on what we call multi-omics analysis,” DeBoer said, but “we are not there today.”
The company is working on a tighter integration between these different modules that will enable researchers to look at proteins, metabolites, and transcripts in a “systems biology approach” by pulling together results from the entire GeneSpring suite.
The new platform includes “dramatic changes to the user interface” to prepare it for the multi-omics data analysis world, he said. The transformation began with GX 10.0, and currently Agilent is adapting GeneSpring MS for the new platform, which will be ready sometime next year, he said.
The new version of GeneSpring MS, which will be renamed Mass Profiler Pro, will also contain some new algorithms to do “recursive peak identification,” which is a two-pass identification step that improves data quality, he said.
Mass Profiler Pro will be part of GeneSpring but can be bought as a standalone product or bundled with Agilent hardware because it will be “tuned” to that hardware, DeBoer said. “The reason for that is that we are also going to market the product separately apart from the systems biology approach, for people who just want to do mass spec analysis,” he said.
Mass Profiler Pro will be able to analyze data from “a variety of vendors” but it will be “very much optimized for the Agilent hardware,” DeBoer said.
Expression analysis has played an important role for GeneSpring, said DeBoer. “It has been well regarded for its special visualization and statistics in the gene expression world,” he said. “[Users] have a lot of trust in our statistics in the analysis of gene expression data and that translates well into our move into the biomarker profiling workflows,” he said.
GeneSpring MS is different from Genedata and Rosetta in that “we do the compound identification first,” he said.
“Rather than focus on peaks, we find the compounds first and do the alignment later,” he said. “We do indeed [have] a data reduction step before we do the statistics,” which “gives us higher quality data, and allows us to analyze many more peptides or metabolites.”
“You can analyze many more compounds than you could by using the raw spectral data,” he said. “Analyzing 160 runs is almost impossible to do with the raw data for a desktop application.”
The desktop application sets GeneSpring apart from other applications, he said. “64-bit machines are pretty much de rigueur and people have lots of memory,” he said, adding that his firm has been doing studies with several undisclosed academic labs with hundreds of thousands of peptides in 160 runs.
Agilent’s software platform is also priced lower than competing software, so it is not “the capital investment” of products by firms such as Rosetta or Genedata, he said, though he declined to disclose the cost of the software.
Like its rivals, Agilent is also looking to integrate other data types, such as patient data, with the platform. “Special calls to a database with patient data can be attached to the information within GeneSpring,” he said. “The algorithms to make use of that [function] are already there,” and further integration is planned.