Skip to main content

Proteinformatics: The New Trend in Tools


Eleven stories up, in the heart of New York City's fashion district, on a floor shared with a clothing designer, there's not an inch to squeeze another cubicle into ProteoMetrics' 17-person headquarters.

"We thought we would grow in a wise way within our means," says company chairman Brian Chait. But that is no longer an option for the software vendor, which got its start four years ago with an NIH small- business grant. Where there's a trend, there is often a crowd.

ProteoMetrics' products, based on algorithms first developed in Chait's Rockefeller University lab and rendered commercially viable by David Fenyo, are tools for identifying proteins. And the company is just one of several positioned to capitalize on new biology's latest craze, high-throughput proteomics, by providing bioinformatics for analyzing proteins — or proteinformatics.

With others quickly moving into the territory and proteomics exploding, "the ante is up," says Chait. Aside from ProteoMetrics, at least two other startups offer options. Matrix Science, based in London, launched its first product in 1999. And just a few months ago, Swedish company BioBridge Computing entered the ring, turning a mass spec software consultancy project for a pharma client into a mission to provide a solution it says is not adequately addressed by others.

The sexy mass spec

At the heart of this trend stands an instrument once relegated to a few obscure analytical chemistry and physics labs — the mass spectrometer. "Until a few years ago," says MDS Proteomics founder Matthias Mann, who was practicing proteomics well before it was in vogue, "the mass spectrometer was decidedly unsexy."

In the late 1970s when Leigh Anderson, who now runs Large Scale Proteomics, and his father Norman were campaigning for a human proteome project, mass spec was considered irrelevant for characterizing proteins.

It was the electrospray and MALDI breakthroughs of the 1980s, and continued improvements in speed and accuracy, that allowed a single experiment to generate data from hundreds of proteins. Today, proteomics facilities, replete with banks of mass spectrometers, are all the rage. Virtually every big pharmaceutical company has built one, and pure-play proteomics companies are popping up like mushrooms after a rainstorm. Mass spec sales are booming, and many facilities tout throughputs of around 50,000 proteins a week. These facilities, says Mann, can generate in a matter of a week "enough data to confuse you for the next six months."

If next-generation instruments, including Applied Biosystems' TOF/TOF, are all they're cracked up to be, the amount of spectra to process will dramatically increase. Celera Genomics, for instance, claims it will be able to identify one million proteins a day. "Once you get to the point where you have much too much data for a person to sit down and run that searching interactively on a computer, that's when you get into a different level of mass spec software," says Anderson.

And that's why there's a niche market taking shape for the kind of software that companies such as ProteoMetrics sell. For processing the terabytes of data mass spectrometers spit out, these specialty vendors offer alternatives to the software that traditionally comes packaged with the instruments, much of which veteran users consider sub par. "It's very good generally to have more than one option out there," says Chait. "[Competition] pushes things along."

Mass-spec moneymakers

ProteoMetrics, which has a branch office in Winnipeg and will soon move its Manhattan staff to bigger digs, has been profitable since it opened shop nearly four years ago. Fenyo, the company's president and CSO, is gearing up for increased demand. He's aggressively seeking equity investments, plans to add 100 employees within 18 months, and predicts annual revenues of $20 million within two years.

But a recent US Bancorp Piper Jaffray report estimates the current size of the entire protein informatics market, including databases and downstream protein informatics tools, at only $45 million.

Mann, whose company is one of ProteoMetrics's biggest customers, believes that the days for boutique vendors focusing just on mass spec informatics may not be long: "Even if you earn some money now, what's going to be the situation in three years? The mass spec vendors will either write software themselves or get somebody to write [it] for them."

Guessing that most users would opt to get their software packaged with the mass spectrometer, Mann says, "I don't think there's a big space for a lot of companies thriving on developing this software."

Instrument makers such as Micromass already appreciate the value of the analysis tools packaged with their machines. "Anybody in a shed with half a million dollars can start a business making MALDI mass spectrometers. It's a commodity item," says Mark McDowall, marketing manager for the market-leading mass spec vendor. Micromass' packaged software accounts for "at least half the reason" customers chose his product, and McDowall doesn't think private vendors have much to offer. "We can do a much better job if we do it all ourselves," he says.

But users don't necessarily agree. Walter Blackstock, worldwide head of proteomics for GlaxoSmith Kline, says, "The pace of software development is such that you usually get faster development from somebody devoted just to that. It's not really a negative thing about manufacturers, it's just that their core business is making machines. Subsequent stages are more effectively done by third-party suppliers."

Spectrum of software

When Fenyo finished his PhD in physics at Sweden's Uppsala University in 1991, proteomics wasn't much of an industry. Bored with mass spec experiments that just reconfirmed decades-old theories, he joined Chait's Rockefeller lab as a postdoc to pursue his interest in proteins, just as the original wave of software for identifying proteins with mass spec spectra was beginning to emerge.

In the early '90s a handful of labs independently developed similar software. "What we saw during this time was that the software that was available was not really what we wanted," says Fenyo. "So we started developing our own, mainly to use in our everyday lab work."

This was also when the Internet was becoming a mainstream phenomenon and a growing medium for researchers to share their work. After posting their software, Fenyo and Chait were surprised by the hundreds of downloads their Web page got each day. "We saw a real need for the kind of tools for mass spectrometry we were making," says Chait. In 1997, Chait, Fenyo, and Ron Beavis, another former lab member, decided to create ProteoMetrics.

They weren't the first to take such software out of the lab. Thermo Finnigan had exclusively licensed Sequest, a popular database search algorithm for uninterpreted tandem MS spectra out of John Yates' lab, in 1994. "It has been discussed that maybe we should distribute it to whomever wants it," says Iain Mylchreest, director of product development for Thermo Finnigan. But for now it's only available bundled with the operating software for the company's instrument.

Meanwhile, Darryl Pappin of the Imperial Cancer Research Fund was writing the code for a similar program called Mowse. In 1998 after Thermo BioAnalysis shut down its MALDI-TOF operation, managers John Cottrell and David Creasy licensed Pappin's work, brought it up to commercial grade, and rechristened it "Mascot" — their new company's first product. Thus, Matrix Science became the second company to sell instrument-independent protein identification software.

Customer in control

Fenyo says the reason ProteoMetrics has never sought venture funding was to keep the company's direction at the founders' discretion. "We would like to have control over what we're doing," he says. And he believes his customers deserve the same.

"Here is a typical protein digest," says the 36-year-old, mild-mannered Fenyo, demo-ing his software. "If you zoom in here, each of these peaks is one peptide." He drags and drops a folder with mass spectra of several hundred proteins onto an icon on his desktop. Within a few seconds it picks the peaks, sends it to a search engine, generates a report, and continues to do this for each spectrum.

Many of the instrument makers offer software with similar function, but, Fenyo says, "most of the mass spec manufacturers make it work with their own data and not with any of their competitors'."

ProteoMetrics' XML-based software accepts data from all of them. That's important because virtually no proteomics facility sticks to one brand of mass spec. "All the manufacturers are continually advancing and they tend to leapfrog one another. Therefore it makes sense to work on the best machines in each class at a given time," says Anderson.

ProteoMetrics doesn't limit its customers to its own search engines in its Oracle-based mass spec data management platform, Radars. "If people want to buy other search engines, we'll just integrate into our system," says Fenyo. "It comes with our ProFound and Sonar, but you can also put in Sequest and Mascot or other search engines."

Of course, he insists that his software is better than Matrix Science's Mascot or Sequest: "Our algorithm is more efficient," Fenyo claims. "It just does it faster. Yes, people like Sequest. But I think it's that they don't know better."

Nevertheless, he knows customers want options. GSK's Blackstock explains: "If you're identifying proteins with two packages it gives you more confidence in your results." A Matrix Science user, Blackstock says he is considering adding ProteoMetrics' software to his arsenal.

New kid on the block

A little more than a year ago AstraZeneca got fed up with the lack of good software to extract the peaks that represent the relevant protein fragments from the continuous spectra its machines were churning out. These peaks are converted into a list of masses before the search engines explore the databases to identify proteins. Back in the days when running one sample at a time was the norm, this was relatively easy to do.

Fredrik Nilsson, assistant director of cell biology and biochemistry at AstraZeneca, says he's been bugging the instrument makers to improve their algorithms to no avail. "We've been on them about this for three years," he says. "And in the beginning they had no idea even what we were talking about."

Matrix Science wasn't much help either. "We use Mascot but it can't pick the peaks accurately," says Nilsson. (Matrix Science acknowledges this weakness in Mascot and is now collaborating with German company BioVisioN to address it.)

So AstraZeneca enlisted a consulting company, Chiral Data, to draft a custom algorithm. Out of that project BioBridge was born, with the resulting Pepex its first product. Now BioBridge is readying to release Piums, its own protein identification software. The peptide extraction program itself costs $6,000. Piums packaged with Pepex will be priced at around $12,000. CEO Martin Waleij says the tool is in beta at Aventis and three biotech companies.

Even so, some major would-be customers say they wouldn't give BioBridge a shot at their business. "I've not dealt with them and I don't intend to deal with them," says GSK's Blackstock. "We would steer clear at this point and let it develop in an academic environment."

Nilsson, who says AstraZeneca is installing the BioBridge software worldwide, counters that the academic roots of the other programs are precisely their downfall. "At the university you will not try to run a thousand samples a week, so they haven't seen the problems of turning the software into production systems."

Proteinformatics resellers

Instrument makers too are beginning to recognize the value of choice to their customers. Bruker Daltonics offers Mascot as third-party search engine software and, according to software development manager Herbert Thiele, intends eventually to offer others such as ProFound.

Shimadzu and, more recently, Applied Biosystems/MDS Sciex have also signed on to resell Mascot. Amersham Pharmacia Biotech is as yet the only manufacturer selling ProteoMetrics' software, but ProteoMetrics and BioBridge have ambitions to see their search engines distributed by instrument makers.

But instrument vendors point out that search engines are just a piece of a larger picture. "It's like having an engine in your car," says Paul Danis, senior manager at Applied Biosystems' MALDI facility in Framingham, Mass. "You're not going to go anywhere unless you have a transmission and a clutch. People use just the standalone search engine if they have a low number of samples to look at."

For industrial-scale proteomics, that doesn't cut it. ABI sells the Mascot search engine as part of a platform called Proteomics Solution 1, which also includes a sample prep robot, a MALDI-TOF mass spec, and a sample-tracking database. It costs between $300,000 and $400,000, depending on the instrument model.

Virtually all instrument makers offer some form of a server-client platform with search engines and sample tracking built in. Micromass, for example, with the launch of version 2.0 of its ProteinLynx Global Server in October, will offer such a product. And Proteome Systems, in collaboration with Shimadzu Kratos and Sigma-Aldrich, is set to release a complete ready-to-wear proteomics platform from sample prep through 2D gels, mass spec instruments, bioinformatics, and databases.

But Fenyo says that while ProteoMetrics' products might be more limited, it can integrate its data-management system into any LIMS the customer chooses, with the added benefit of instrument independence. "Our software is the only one on the market that can work with all the different kinds of manufacturers," he says. "It's clear that for a while most companies are going to buy mass spectrometers from several different companies."

Fenyo says that's his big advantage: "Instead of people having to sit and analyze the data of different kinds of software, they can instead use one software to analyze all their data."

Company: Micromass (Division of Waters)

Software: ProteinLynx Global SERVER

Version 1.1: scalable, client-server search engine for automated high-throughput searching of protein and genomic databases with MS and MS/MS data.

Version 2.0: proteomics platform that performs database searching, post-translational modification analysis and de novo sequencing. Includes a sample tracking system.

Distribution: With Micromass instrument and through the ProteomeWorks System in alliance with Bio-Rad.

Throughput: For queries of the NCBI NRDB protein databank on a single processor: 0.5 to 4 seconds per ESI or MALDI MS-MS spectrum.

Searching an entire MALDI MS spectrum (containing more than 3,000 peaks) ranges from 5 to 40 seconds.

Price (in US): $15,000 to $100,000+, depending on hardware configuration.

Version 1.1 launched June 2001
Version 2.0 available October 2001

Company: BioBridge Computing
Software: Pepex
Peak extraction software
Launched Spring 2001

Piums (includes Pepex)
Protein identification software
Launch expected Fall 2001

Company: Compugen

Software: ProtoCall

Current release identifies proteins using MS data.

Next release, expected by year's end, will also search databases using MS/MS and peptide sequence tag data and will also include a peak extraction program.

Distribution: Through In the near future, it will be available for installation at the customer's site as well as part of Compugen's Gencarta database product or as a standalone application.

Beta testers: Max Planck Institute, Proteomics Center of the University of South Denmark, Australian Proteomic Analysis Facility, Weizmann Institute of Science, Dr. Lottspeich's laboratory, and Vision Technologies.

Throughput: Provides protein identification from a raw spectra or a peak list through the Web in less than 30 seconds, on average.

Price: Protein identification using both public and proprietary databases on website is free.

Launched on June 19th.

Company: Amersham Pharmacia Biotech

Software: Ettan MALDI-TOF software

Includes ProteoMetrics search engine and Oracle database in a Windows NT format. Supports both peptide mass fingerprinting and seamless post-source decay searches integrated into the Scierra Laboratory Workflow Software for tracking of samples from sample preparation to protein identification.

Distribution: It is only sold with the Ettan MALDI-TOF via Amersham Pharmacia Biotech.

Throughput: Approximately 60 completely automated protein identifications/hour (including data acquisition, seamless database searching and result presentation).

First introduced in October 2000; launched with new Windows NT software May 2001.

Company: Matrix Science

Software: Mascot

Peptide mass fingerprint, MS/MS, and peptide sequence tag searches.

Supports data formats from all leading manufacturers. Available for Windows 2000 / NT, Linux, Tru64 Unix, Solaris, and Irix.

Distribution: Direct sales and through Bruker Daltonics, Applied Biosystems/MDS Sciex, Shimadzu, and Infocom (Japan only).

Customers: Nine out of the top 10 pharmaceutical companies. GeneProt runs Mascot on clusters of more than 1000 Compaq Alpha processors.

Price: Web access is free. Entry level for intranet license is approximately $10,000.

Launched in 1999.

Company: Applied Biosystems/MDS Sciex

Software: BioAnalyst

Core applications include peptide mass fingerprinting, automated Bayesian reconstruct tools, theoretical proteolytic digest generation, protein sequence browser, automated de novo sequencing, automated MS tag finding and database search, and integrated database searching using PepSea. Compatible with Mascot.

Pro ID

Can search 10,000 spectra in less than 45 minutes at a speed of more than three spectra per second against a non-redundant database from batches of LC/MS/MS files.

Automated quantitative protein expression analysis and protein identification when using the ICAT reagents kit.

Price (in US):
BioAnalyst: $10,000
Pro ICAT: $7,000
Pro ID: $5,000

BioAnalyst introduced March 2001.
Pro ICAT and Pro ID will be launched this summer.

Company: Thermo Finnigan

Software: TurboSEQUEST
Cross-correlates uninterpreted MS/MS mass spectra of peptides from protein or nucleotide databases.

Distribution: As a layered application within the Xcalibur data management and operating software of Thermo Finnigan mass spectrometers.

Licensed in 1994 from the University of Washington.

Company: Bruker Daltonics

Software: SNAP (Sophisticated Numerical Annotation Procedure)

Mass spec peak extraction algorithm.

Peptide mass fingerprint, MS/MS, and peptide sequence tag searches using Mascot.

Web-based client-server platform for storing, organizing and analyzing proteomic data using relational databases. It is integrated with biotools.

Company: ProteoMetrics

Software: Knexus

Includes ProFound for peptide mass fingerprint searching, Sonar for MS/MS searching, and peak extraction program. Starts at $15,000.

Oracle-based client-server data management system. Includes Knexus applications.

Starts at $95,000.

Access to search engines free on Web.

Customers: 40 pharma, biotech and academic and government labs. Biggest customers are Large Scale Proteomics and MDS Proteomics.

Three methods commonly used to identify proteins by searching databases

Peptide Mass Fingerprints

This method is the simplest of the three. Bill Henzel of Genentech first proposed it in 1991 and several academic labs developed software using the approach within the next few years. An enzyme cuts a protein at specific amino acids. The combination of masses of the resulting peptides represents a unique fingerprint. The algorithms then compute the theoretical masses of all the proteins in a database if they were to be digested by the same enzyme. This identification method does not work well with protein mixtures.

Peptide Sequence Tags

This method, developed by Matthias Mann and Matthias Wilm at the European Molecular Biology Laboratory in 1993, uses MS/MS data. The peptides collide with gas molecules and break apart in a way that each fragment is one amino acid longer than the next. Subtracting the mass between several fragments reveals some sequence of the peptide. Using the partial peptide sequence together with mass data allows for more specific probing of databases.

MS/MS Ion Search

John Yates and Jimmy Eng, in 1993 at the University of Washington, figured out how to search the databases with MS/MS data without first interpreting the data to obtain sequence. Nobody thought this was possible. In fact, Yates says his paper describing the algorithm was rejected by two journals before the Journal of the American Society of Mass Spectrometry accepted it.

Buyer Beware: Proteinformatics Isn't Foolproof

No matter how a protein identification tool is packaged, chances are it's neither foolproof nor fully automated. "I would be hesitant about taking those identifications at face value," says Matthias Mann.

The reason? In nucleic acid searches, which rely on a linear string of bases, if several bases are off you can still get many useful hits in a database. A protein's identity, on the other hand, is encoded in its fragments' mass — an exacting and unforgiving measurement.

"If a mass is 912, it is 912. There is no discussion about it, which is good when it gives you a positive identification, because you know with almost 100 percent certainty that you are correct," says GlaxoSmithKline's proteomics guru, Walter Blackstock.

But a single amino acid change, a contaminant, a post-translational modification all change the mass of a peptide. An expert must then scour the data to verify a protein identification. "It still requires quite a degree of skilled human intervention," Blackstock says. It is especially difficult when using peptide mass fingerprinting to identify a mixture of proteins.

Another source of ambiguity is that the algorithms don't search against actual mass measurements, but rather compute the masses of peptides of proteins in the database on the fly.

David Fenyo concedes that real automation is still some time away. "Everything gets analyzed automatically first and all the data get stored in the database," he says. "But then one can go in and manually look at the ones that are in the gray zone."

Despite the shortcomings of the programs, Blackstock argues that the focus on improved technology is overplayed. "We have tools these days that are so powerful that we could grow for another decade. What really needs attention is the upstream and downstream biology."

— AS

The Scan

And For Adolescents

The US Food and Drug Administration has authorized the Pfizer-BioNTech SARS-CoV-2 vaccine for children between the ages of 12 and 15 years old.

Also of Concern to WHO

The Wall Street Journal reports that the World Health Organization has classified the SARS-CoV-2 variant B.1.617 as a "variant of concern."

Test for Them All

The New York Times reports on the development of combined tests for SARS-CoV-2 and other viruses like influenza.

PNAS Papers on Oral Microbiome Evolution, Snake Toxins, Transcription Factor Binding

In PNAS this week: evolution of oral microbiomes among hominids, comparative genomic analysis of snake toxins, and more.