Name: Manor Askenazi
Position: Founder, the Ionomix Initiative
Background: Project leader in Mass/Bioinformatics, Thermo Electron; lead bioinformatics engineer, Blais Proteomics Center, Dana Farber Cancer Institute
Bioinformatician Manor Askenazi has been working with proteomics data for more than a decade, beginning at biotech firm Microbia then moving to Thermo Fisher Scientific's [then Thermo Electron's] Biomarker Research Initiatives in Mass Spectrometry center where he worked to design and build the company's Sieve software for label-free mass spec-based protein biomarker discovery.
After that he moved to the Blais Proteomics Center at the Dana Farber Cancer Institute where he worked as the lead bioinformatics engineer under center director Jarrod Marto.
Recently, he struck out on his own, forming the consulting firm Ionomix and developing two new bioinformatic tools – Slice and OncoSlice – aimed at improving access to proteomics data repositories.
At this week's American Society for Mass Spectrometry annual meeting in Vancouver, Askenazi participated with several other proteomics bioinformaticians in a panel discussion on the state of mass spec data repositories, currently a key issue for the field (PM 5/4/2012).
He spoke afterwards to ProteoMonitor about the questions and challenges facing such repositories and how his Slice and OncoSlice tools might fit into proteomics research.
Below is an edited version of the interview.
Could you describe the Slice and OncoSlice tools? What's their current status?
Slice and OncoSlice are currently online. They are tech demos, but they are trying to achieve something that is a challenge. Making [raw mass spec] data available in its totality is straightforward – you just download everything. But to make a subset of the data available dynamically calculated based on a request is difficult. So this is a tech demo of kind of an indexing scheme [that makes such data availability possible.]
So, if you have an idea about a mutation or something, and you translate that into a mass, you can ask, [for instance] where was the signal at that mass in a large number of files, and get just that aspect of the data. And there are a lot more queries that you could make. For instance, another one is: Here's spectra that I have. I want to look at spectra like it. We have ways to look at a spectra and ask if you saw it before, but we need ways to say, "Did I find something that is remarkably similar but not identical?"
So the platform is intended to let people more easily access and query raw mass spec data sets?
It's a platform for data accessibility. You should be able to design all sorts of applications on top of it. One of the things it does is it makes the data available as a URL. Whether it's a reconstructed ion chromatogram or an individual spectrum or a spectrum overlaid with an explanation – all of those are [presented as] just URLs. So all you have to do to access the data is issue HTTP queries, and that means that the person writing the software [to perform the various queries] doesn't need to know any C++ or any of the mass spec data systems. Anybody who has written web pages that do Ajax or anything like that can write applications against this kind of data repository.
What are the origins of the system?
It builds on developments at Jarrod Marto's lab at the Blais Proteomics Center at Dana Farber. Step one [there] was wrapping the DLLs from the vendors that allow opening of the raw files with something that is easier to program, and in our case that was the Python language. So we made the various vendor APIs uniform in the Python language. Then the next step was the mzServer, so that if you make a URL query there will be a server that gets that and translates that into Python, which then translates that into the vendor format and does the query. So now you're exposing this [raw data] through a browser.
How could easier access to raw data affect proteomics research?
We as a field have smaller mindshare. If you are a computational person, and you want to do something with life sciences, you go to [the Gene Expression Omnibus] and you have thousands of data sets to play with. I think if all you need is a browser and you can start looking at [mass spec] datasets, there's still a learning curve, but I think it might reduce the barrier of entry to people who might want to take a look at this a little bit, and I think we might get more insights this way.
I'm interested in mindshare for the topic, and I'm particularly interested in eyes on data coming from tumors. For example, I'm thinking of putting together around a site like OncoSlice a bounty system where, for example, if you find a mutation [in a dataset], you e-mail it to the site that is doing the bounty project and you would get a prize. And if over a certain period you got the most of those, you would get the remainder of the bounty.
What I'm trying to say is that one of the unexplored benefits of having a repository is that it's also a place around which you can create events and competitions and things. A lot of times in the discussion [about mass spec data repositories] people say, "Oh, this isn't sexy but we need to fund it." There's a notion that repositories are boring. But not only does it not have to be boring, it's a potential avenue for getting more people looking at this stuff, because you can say: You go here; you can explore the data or write programs that explore the data; and if you find something there could be a competition mechanism or even a prize mechanism. I would love to set up a prize for finding mutations in samples, especially [very valuable] samples like primary tumors. So I'm looking especially with journals to find some sort of venue where I could announce something like that and also maybe get a sponsor so I could increase the value of the prize.
So by combining repositories with a higher ease of access and more extensive querying ability you think you can encourage more analysis of data, more eyes on given datasets?
Yes, exactly. A repository doesn't have to be boring. Once you have a repository and every spectrum has a URL, then you can pass them around more easily, because you can then say, "You know that spectrum? This is what I think it is. What do you think it is?" And you can e-mail it around easily, text someone a link, and it makes the discussion more fluid.
It would be a lot harder to say, "If you download this [entire] dataset, I think I've seen something interesting in it." I don't want to minimize the fact that even when you have the ability to slice through the data [in this way], the primary way you're going to look at the data is to run it through a pipeline and learn what's in there on a large scale. But then for reanalysis, [a system like Slice] is very useful.
The other thing is, this is just a tech demo. I'm personally not nearly done. I want to add capabilities for other more intelligent queries. So the more intelligent this thing becomes at querying, the less coding is needed to make intelligent tools that live on top of it.
Would increased access mean an increase in users for raw mass spec data repositories and therefore better funding chances?
Yes, I think so. There are two things: One is that when we make repositories available they should have the ability to serve up slices of information in a very targeted way without just allowing everything to be downloaded, and those slices should be not just scans but also [extracted ion chromatograms]. So quant slices, because it's relatively straightforward to make it so that if you want an individual scan you can get the scan. It's more complicated to be able to say, "I want to know about this precursor. I want to know how intense it is over time."
The other thing is an observation: There are fields where it isn't possible to just move the data anywhere because of the amount of it, so you have to use some of your grant to enable in situ analysis, effectively. That's exactly the case in astronomy. That analogy was driven by Marto. I always thought that any large data-rich topic was a good analogy to what we're running into, but he really pointed out specifically astronomy. And he's right in that they ended up having proposals like the WorldWide Telescope, which recognizes that you have to bring computation to the data and not the data in bulk to people who are curious about it. Astronomy is one of the topics that has realized this the most, and we're going to need something like that. We're going to need repositories that are not just a parking lot where you can get stuff, but that are actively supporting the queries.
I think you saw in the discussion, some people were saying, "But at a bare minimum we need a parking lot." And that doesn't have to be fancy. It can just be an FTP site. Because there is a scientific emergency aspect to all this – we're going to lose information and then we're doing hearsay, not science – like, "I once saw this peak," you know? So there's an emergency aspect to what's going on, but this is also an opportunity to rethink what these systems should provide. And [Slice] is perhaps an extreme on the scale of how much the repository should support the user, because we're saying the data system should be smart enough to answer very specific queries, like, "Give me the files that have an intense signal at this mass." That's much more than just: "There was an experiment; it was done; here are the files; here are the peptides that were ID'ed."
Are other groups also working on this problem of improving repository interfaces to make the data easier to access and manipulate?
Different people emphasize different aspects. I have a particular interest in designing interfaces that are easy to use, but it's not obvious that's the first thing you should spend resources on when you are setting up a repository. For instance, because I'm emphasizing the ability to access the raw data in sort of a fluid fashion, I haven't been worrying about what is a good general schema for keeping the metadata around. Different people who approach this problem are working on different aspects.
How integrated are these efforts? Are the various developers working across the range of repository issues talking to each other?
I think the workshop was a start. Having a discussion saying, "What can we do?" We obviously all meet and potentially put the various pieces together. It would be great if a repository came about that used an indexing mechanism like Slice to allow arbitrary queries like this but then also had metadata capabilities – if there was a mix of these things.
How much data does Slice enable access to right now?
It's a tech demo, so I took the human build from PeptideAtlas, which contains about 2,000 accurate mass raw files. And then OncoSlice uses tumor data that Cell Signaling Technologies generated in a publication several years ago and recently made available.
So what is a likely path forward for the tool? Do you hope to get large repositories to start indexing files so that they can be accessed by Slice, or are you aiming to get funding to store more and more data on servers of your own?
If [large repositories] get requests to access data in this fashion, they'll do it. In terms of my funding the future development of this, it's a matter of if there is a repository that wants to transfer the technology or whether there are just large research sites [that] have a lot of data [and] want to access it this way. [In the latter case] then I'll just develop it because there is individual demand.
So you could do this on just a site-by-site basis if people were interested?
Yes. I have gotten requests. I'm just trying to figure out the mechanism that maximizes the chance of getting a good quality product. If a government-scale site said they wanted to fund the ability to do this, that's the maximum chance of the technology being able to [expand to] include new kinds of queries and all that sort of thing. If it's individual sites that want it, you would have to naturally pursue the particular features that would make sense for that site.
The simplest path forward is to have datasets that people want to experiment with loaded onto a subdomain of Slice. So, give me the data; I'll put it up; I'll host it; and that way you get a sense of what additional abilities it gives you over what you already have. And then I'm also looking to see if there are sites where people want an instance [of the tool] running entirely internally just as a local server. I'm not experienced at dealing with grants from large government entities, but I have friends that are, and so we're working on that, as well.