A true bioinformatics veteran, David States joined the National Center for Biotechnology Information when it was founded in 1988, and moved to Washington University, St. Louis, in 1992, where he played a key role in the university’s work on the Human Genome Project. In 2001, States left Wash U. to join the University of Michigan as director of bioinformatics, where he is now immersed in his next big bioinformatics challenge: coordinating data collection for the Human Proteome Organization’s Human Plasma Proteome Project. This effort, currently in its pilot phase, plans to characterize as many proteins as possible in human plasma and serum, using a variety of technology platforms. BioInform spoke to States recently, following the project’s initial data submission deadline, to discuss his first impressions of the complex sea of proteomics data he’s studying.
Compared to your experience working on the human genome project, what do you see as the primary data management challenges for the HUPO plasma project?
It’s been an interesting process, particularly coming from a genomics background. I think by comparison with genomics, proteomics is in the early days. There’s just a lot of variation across the community in terms of what do you actually mean by identifying a protein? Protein identification can mean anything from a signal that a member of this protein family is there to a completely detailed post-translational map, and defining the N-terminal and C-terminal and so forth. And you could put anything in between as an identification.
There’s also so much difference in the instrument technology and experimental strategies. When you think about it, by the time the genome project really got running, Maxim-Gilbert sequencing had sort of dropped by the wayside and 99 percent of genome sequence was acquired by Sanger dideoxy sequencing. For proteomics, we have MALDI vs. electrospray, time-of-flight and different Fourier transform or ion trap instruments; a number of different strategies for removing things like albumin; a number of different fractionation strategies to simplify the mixture going into the instrument; and then strategies based on whole-protein identification vs. shotgun strategies where people are doing peptide digests up front, and then sequencing individual peptides without knowing which protein they came from. So [there’s] a really wide spectrum of experimental strategies, and not surprisingly, different groups are finding different things. One of the things that will be interesting — and we’re still early in trying to get our hands around all the data — is what the basis is for the differences that we see.
Are any of the emerging proteomics standards helping to resolve this issue?
I think the standards are also evolving. The EBI has been talking about what was originally PEDRO [Proteomic Experiment Data Repository Schema] but is now being called MIAPE [Minimal Information About a Proteomics Experiment], in parallel with MIAME, so it’s minimum annotation for protein experiments. But the reality is that out of 40-plus contributing labs, only two used it. Some of this is just that the software tools aren’t really there yet — it isn’t built into the software that the companies are shipping with instruments — but also I think it’s fair to say that [standards efforts] have been more driven by the computer scientists than the experimental biologists, and we need to get connected a little bit better.
There’s another standard actually, mzXML, that Jimmy Eng at the Institute for Systems Biology has been using. It’s really more of an internal lab data interchange tool that he’s developed, so it has its limitations as well, but for Jimmy it’s been very useful because he can take data from all of the different instruments that they have at ISB and get it into this mzXML format and then analyze it with a common software platform. He is reanalyzing some of the submissions that we’ve gotten to look at some of the software dependency issues.
There’s also a question of what are the qualities that you want in a database? Do you want it to be very highly annotated and high-quality data, so that if you find an entry there, you’re sure this is a real, bona fide protein; or would you like something that has anything that might conceivably possibly be a protein, including lots of things that probably aren’t? So getting a database to standardize on has been a little bit of a challenge. We actually made a decision as a group to use the July 2003 release of the IPI [International Protein Index] database. In fact, between July and September, something like 20 percent of the entries in that database were revised, and that created a little bit of consternation.
So one of the things we’re going to be assessing is the role of databases in identification. Unfortunately, it’s not like Blast where you do your sequence and then search the database, and you have nicely worked out statistics and you can say, ‘This is an alignment that I wouldn’t have expected to find at random.’ Those kinds of statistics and implementations are still being worked out in proteomics, so for many of the packages, it will tell you the best matching peptide, but it’s possible that that wasn’t in fact the true peptide sequence — only the best matching one. That’s part of the reason why you might want a more expansive, but noisier, database.
Is HUPO envisioning a standalone database for the plasma project, or will it be a subset of another existing protein database?
We’re actually also talking with the [HUPO] liver proteome group, and I think the EBI IPI database would like to be the reference database against which we search these kinds of projects. We would like to develop an experimental resource with more of the experimental data, and I think that’s the niche that I see Michigan filling. And then there will be other project-specific databases. It’s not clear how many of them will be public and how many will simply be, ‘This is the whatever center’s project database for proteomics, and when we get the project finished, we’ll report the data.’
We are maintaining more of the experimental data, and links to individual labs — which lab did this come from, what were the techniques used, how confident are we in this identification, and was it found in serum but not in plasma, or found in the liver but not in serum, and so forth. IPI is really the collection of [all the] proteins that a human cell has. How much they get into tissue-specific localization is an area that will need to be sorted out. I would hope that they would include post-translational modification information, but at present that’s still in its infancy.
Looking at the data that you have on hand now, do you see any trend toward a specific platform or analytical method?
In terms of numbers, the 2D-gel/LCMS-MS kind of approach is probably the largest single category of submissions, but not the highest throughput. A number of the biggest labs actually have multiple different instruments and technologies. There are a good number of MALDI-TOF submissions. The shotgun approaches have been from the bigger labs; they’re really gearing up to do high throughput, and that’s what you need to do for that kind of an approach. We have not had a lot of MALDI-TOF-TOF, but I think those instruments have only recently been shipped.
In terms of overall data, we have over 40 labs contributing various aspects of the project, with substantial contributions from 18 labs around the world. Right now, we’re still digesting the data, so I don’t really want to make hard statements about numbers of identifications or things like that because it could change a lot. But in that first pass, one of the big labs had a large submission where there were a substantial number of entries that we just couldn’t find in the database, and that’s where we discovered that they’d actually run it against the September IPI release rather than the July release. Once you realize that what you really want to do is merge every entry that was ever in IPI, then all of a sudden you can find the identifications. So 90 percent of the battle is realizing that this is what you have to do — in order to compare things, we need a common reference database to compare them against.
What kinds of discrepancies are you seeing among groups using the similar experimental methods?
We’ve been finding, particularly for the MALDI-based labs, that within the labs, they’ve been fairly reproducible. We’re going to have a jamboree, probably over the summer, where we get everybody together to do a final data review and make sure that when we say things didn’t match, that there’s not something we’re missing.
One of the challenges has been if you identify a protein based on a limited number of peptides, one of the search engines may report isoform one and another reports isoform two, and if you look at it you might say, ‘They didn’t match,’ but if you actually look at the peptides they reported, you can say, ‘Well they didn’t report the same protein but they’re both consistent with both identifications.’ That’s what I mean when I say there is disagreement even about what constitutes an identification.
I’ve been telling the grad students here that proteomics is where genomics was in 1980. I think for proteomics, the range of experimental strategies is at least as wide as what was going on in genomics, and because proteins are chemically so much more diverse, it probably is going to require a much wider range of strategies. I think what will be interesting about this project is it is the first time we’ve taken a physical sample, divided it up, and sent it to a bunch of different labs and said, ‘What do you find?’ We didn’t expect that everyone would find precisely the same thing, and we’re trying to work out how much difference there is, and how much we can understand the basis for that.
Have you come across any particular tricks or techniques so far to help collecting and manipulating this data?
It has been very useful to just be running a relational database and get the data into that so we can do the systematic quality assurance checks. I think that even if the submissions are in XML, at some level we’re going to reduce it to relational databases, because the arbitrary query capabilities are really useful in this kind of quality assurance testing.
It is quite interesting to see how much variation at the protein level seems to be coming from the genome. As we get into cellular signaling and areas like that, I think we’re going to be discovering that there’s at least as much going on in post-translational modifications and translocations and transport phenomena as in basic gene structure. Because if you think about regulatory systems that have been carefully studied, the [idea] of, ‘This gene makes a transcription factor that activates that gene,’ is pretty rare. Instead, you have, ‘Well, this transcription factor inhibits that transcription factor through a protein-protein interaction rather than repressing its transcription.’ So there will be an enormous amount of interesting work that comes out of this.