Senior database architect
At A Glance
Name: Eric Deutsch
Position: Senior database architect, Institute for Systems Biology, since 2001.
Background: Senior systems engineer, Microsoft, 2000-2001.
Postdoc in astrophysics, University of Washington, 1998-2000.
PhD in astrophysics, University of Washington, 1998.
At last week's Proteomics Standards Initiative workshop in San Francisco, Eric Deutsch spoke about the importance of standards in developing the Trans-Proteomic Pipeline a set of analysis tools developed for mass spec analysis at the Institute for Systems Biology. ProteoMonitor spoke to Deutsch to find out more about his background, and his work on the pipeline and other projects at ISB.
How did you get into working on proteomics databases?
I got my PhD at the University of Washington in astrophysics of all things. I worked as a postdoc in that field for a while, and then I was looking for something different to do, and at that time the Institute for Systems Biology was starting up in Seattle, and I knew one of the faculty members who was part of the start up. He encouraged me to consider coming there. It seemed like a great place, so I made the switch from astrophysics to bioinformatics at that point.
I had a fair amount of relational database experience, and they were looking for people to help them with getting databases to store and organize the large amounts of data they were starting to generate. So my role title is senior database designer. I get to work with a number of different groups at the ISB microarray, proteomics, and all the other core facilities designing database systems. And then I also bring them together to integrate and analyze together. I work in Ruedi Aebersold's lab, but only part of the time.
So I started at ISB in 2001, but I didn't start working with the proteomics database until 2002. Microarray [analysis] was the first need that they had, and then I've worked on other modules for SNP genotyping, for flow cytometry, and other analyses as well.
Did proteomics draw a lot from the microarray databases?
Certainly mass spectrometry is a very different platform. There are connections with the microarray side of things on the sample side, and then at the genes-to-proteins side. Ultimately, what researchers want to be able to do is to take the proteins they observe and link that back to the mRNA results from their microarray experiments. So the proteomics module and the microarray database they're more or less separate databases, but you link them by the proteins, genes, and samples.
What was the first database that you worked on at ISB?
I initially started working on the Systems Biology Experiment Analysis Management System, or SBEAMS. There's a microarray module and a proteomics module and a flow cytometry module and whatnot, so I initially started working on the proteomics module. The module is designed to load in liquid chromatography-tandem mass spectrometry experiments. The sequence search is designed around our Trans-Proteomic Pipeline analysis chain, so you can load Peptide Prophet results and Protein Prophet results, and use the database to query across experiments.
Can you describe what the Trans-Proteomic Pipeline is?
The first step is to take specific vendor formats that come out of the mass spectrometers and convert that into one format that we can feed into the pipeline, and that's called mzXML. So we take initial MS run output from Thermo, Bruker, Waters, whatever vendor, convert it into mzXML, and then we can put it through any number of different search engines Sequest, X!Tandem, Mascot, and others. And then the results from the search engines [are] collated into a format called pepXML.
The next piece of software in the pipeline is called Peptide Prophet. Its purpose is to take all the putative spectrum identifications that come out of the search engines, and to develop a model to be able to discriminate the correct from incorrect identifications. It assigns a probability of being correct to each of the spectrum identifications, and also builds a model that allows you to calculate a global error rate and sensitivity rate based on whatever cutoff you choose. So you can say I'm able to tolerate a three percent error rate, or a one percent error rate, and that allows you to select cutoffs so you know what your error rate is and also what your sensitivity is.
The next step is sometimes quantification. So if you're doing a quantitative experiment using ICAT or iTRAQ or any of the other labeling methods, there are some pieces of software like Express and ASAP Ratio and Libra [that] allow you to quantify relative protein expression from your samples. So that's an intermediate step if you have that kind of experiment.
And then the next step is Protein Prophet, which takes all of the individual peptide identifications and does the protein inference. It builds a list of proteins that you identified in your sample based on the peptide data. In human and mouse and other eukaryotes, it's sometimes ambiguous. You have peptides that tend to map to multiple proteins, so it's kind of a difficult problem. At the Peptide Prophet level, each spectrum is scored individually, but what you find is if you can look for a given peptide that maps to a certain protein, or if you have multiple peptides that map to one protein, you have higher confidence that you've identified those peptides correctly, and that you've identified that protein, and so you can adjust the probabilities of identification.
That gives you the final list of proteins, and a list of the constituent peptides. And where you go from there depends on the individual experiment. We have several ways to analyze things. One way is you can load it into this SBEAMS proteomics database, and use some of the querying and experiment functionality that's available there. We also take the results of these many experiments and load them into the Peptide Atlas.
The Peptide Atlas project is a way to take the results of many different experiments done for many different reasons, and compile a master list of all the peptides, and therefore proteins that have been observed in all of these experiments, and then we map them to the genome. The idea is that if you're analyzing a new experiment, and you've identified some peptides or proteins, you can take a look in the Peptide Atlas and see if that peptide or that protein has been observed before in other mass spectrometry experiments, and in what kind of samples they had been observed.
And we're planning to soon add functionalities so you can look at the spectra. So you can actually compare your spectrum to spectra that have been taken before on possibly different instruments of that particular peptide. That might give you more confidence that you've identified something correctly, or not.
And then an additional goal is the Peptide Atlas now contains a list of peptides that you know you can observe in a mass spectrometer, and you know to how many loci in a genome they match, or to how many proteins they map, so you can use the atlas to select individual peptides for proteins of interest, and target those specifically. So if you're only interested in kinases, or something like that, for a particular experiment, instead of doing a broad, shotgun experiment such that you sequence everything, you might choose to only go and look explicitly for peptides that you know you can observe. Those peptides have been observed before many times, you know what their spectrum looks like, you know that they only map to one proteins so that there are no ambiguities, and that way you spend less time sequencing things that you don't care about.
How was the Trans-Proteomic Pipeline established?
It started as a set of individual tools. So Peptide Prophet, Protein Prophet, ASAP Ratio just started off as independent tools, and initially they didn't even share common data formats, so it was a little difficult to get data from one into the other. And then it became clear that there's really a chain of analysis that people want to go through. So this is mostly the work of Andy Keller, though other people in the Aebersold lab have been involved. They took those tools and made sure that the output of one could go into the input of the next one, strung them together, and the Trans-Proteomic Pipeline was a name that Andy Keller came up with.
How are the Proteomic Standards Initiative's official standards relevant to the Trans-Proteomic Pipeline?
I think PSI's main goal is to provide both descriptions of what should be a minimum amount of information to publish and make data sets available, as well as to provide a standard format that can encode all the information that you should provide when you post a dataset on a repository.
The formats that I just talked about mzXML, pepXML, protXML those were really developed not so much as community-wide standards that everyone could publish their data in, but rather as a communication mechanism to make our pipeline work. And that starts with a vendor-independent format called mzXML, which is similar to what PSI has developed, called mzData. That's really where the greatest overlap at this point is, and it's certainly a goal, I think of everyone, to unify those formats.
The goals and the requirements we have for mzData and mzXML are somewhat different. Ideally, we can merge those into a single standard so the vendors don't have to worry about supporting two different methods. But it is true that there are some different requirements that we have that the PSI does not think is critical for [its] goals.
What are you working on for the future at ISB?
There are a lot of different software projects going on in the Aebersold lab at the ISB. There are new quantitation software [packages] that are being developed to be integrated into the pipeline. One of the postdocs there is working on a spectrum library search tool.
Historically, search engines like Sequest and Mascot and X!Tandem search the sequence database to try to identify the spectra from the mass spectrometer. In doing that, you compile a very large list of peptides that you've seen before, and you can then compile an MS/MS spectrum library a library of all the individual spectra that you've seen in past experiments and then you can build a tool that will compare the spectrum that comes out of a mass spectrometer to a library of spectra that you've seen before. And that turns out to much faster than trying to scan through a protein list, to generate synthetic spectra, and then to compare your synthetic spectrum to the actual spectrum of the mass spectrometer. Instead you just compare it to spectra that you've seen before.
That method is much faster, and in many ways more reliable because the synthetic spectra that are generated by tools like Sequest typically don't have much intensity information, and it's true that the intensity shape of a theoretical spectrum often looks little like the intensity shape of a natural spectrum. And so most of the search engines just rely on peak positions, rather than peak intensities, while if you're doing spectrum matching, you can rely on both positions and intensities, and get much more accurate matches.
That's another project that's related to the Peptide Atlas. There's a tool called QualScore that was recently published as well. It's kind of a known fact that you typically identify in the range of between 10 and 20 percent of the tandem mass spectra that you take. The other 80 percent are often not identified, and many of those are just low-quality spectra that you can't identify because the signal to noise is too low, or whatnot. But there is a population of high-quality spectra there that you're just not able to identify because the sequence that you're looking for is not in your reference database.
So what this QualScore program does is it takes the population of all the spectra you have identified with high confidence, and a set of spectra that you were not able to identify at all, and it treats them as good and bad, and then tries to learn from those two training sets what all the good spectra in your dataset are, based on various metrics like estimated signal to noise, the number of peaks you have, the number of strong peaks to faint peaks, and so on and so forth. It tries to estimate, or calculate, which are the high-quality ones and which are the low-quality ones, and then you can extract a list of high-quality, unassigned spectra, and then go after those in more detail to look for interesting post-translational modifications, or sequences that may not be in your protein list, but may exist in the genome, for example.
The program is available for download right now. If you just go to www.proteomecenter.org, there's a software link right there, and there's a link to all of the different software tools that we've developed.
Do you have any other comments on the importance of reporting standards in relation to your work?
I think publication standards that a large group of people can agree on are very important. I've been involved in the past in other standards for the microarray community. I've recently been working on a standard for gene expression localization experiments it's called MISFISHIE [Minimum Information Specification For In Situ Hybridization and Immunohistochemistry Experiments]. It's hosted on the Microarray Gene Expression Data society web pages. I think minimum information standards for experiments MIAME, MIAPE are very important. I think one of the important things to remember is not to make them too exhaustive try not to include too many requirements in them. The main goal of these things is to encourage authors to provide enough information for people to be able to understand the experiment, and to reproduce the experiment. It's important not to go overboard, and to make too many requirements.