PhD candidate in bioinformatics
University of Michigan
Name: Jayson Falkner
Position: PhD candidate in bioinformatics, University of Michigan, 2004 to present
Background: PhD candidate doing work in Philip Andrew’s laboratory, focusing on proteomics, including open-source data and software collaboration, mass spectrometry, and peptide chemistry; BS in information technology at the University of Miami, 2003.
The lack of raw proteomics data has been cited as one of the most pressing problems in proteomics research. Without such data, researchers have not been able to reproduce and independently validate experiments from fellow researchers.
Recently, Nature Biotechnology joined a growing list of journals and funding agencies that are asking researchers for raw data from their proteomics experiments.
A year ago, Jayson Falkner, a PhD student at the University of Michigan, set out to address the issue by creating the Tranche project, a database into which researchers can deposit raw data.
The Tranche uses “peer-to-peer concepts mixed with modern encryption to make a secure distributed file system that is well-suited for proteomics research data and independent of any particular centralized authority,” according to Tranche’s website.
The Tranche project is now part of ProteomeCommons, a free and open-source public repository for digital content related to proteomics. Falkner, Philip Andrews, and Peter Ulintz, all affiliated with the University of Michigan, created ProteomeCommons.
Since going online, the Tranche project has amassed about 350 datasets, or about one terabyte of data. An additional two terabytes of data have yet to be annotated and put online. In total, Tranche has about 63 terabytes of storage space.
Below is an edited version of a conversationProteoMonitor had recently with Falkner.
Describe your Tranche project.
The goal is to make it as easy as possible for people to share their scientific data, and specifically to share their proteomics data. Technically, the system’s designed to handle virtually an unlimited amount of data and users. The intention is that people should be able to use it for free, basically.
The motivations behind it are the journal requirements. Most journals are asking for people to publish their data along with their articles … so if I see a study and I see search results, I can repeat them. And not only can you repeat them, but possibly improve on them, or find different answers, or compare them against different studies.
We’re doing that ourselves [in our lab], and we were frustrated that we could not easily access datasets we wanted to get ahold of that were supposedly public. And not only that, but we had to share our own datasets. The Tranche [was] our lab’s solution to doing that, and then we decided to go ahead and just put it up on ProteomeCommons and make it more of a resource for all data for proteomics that we could get online.
Why are researchers so hesitant to make their raw data publicly available?
There seem to be two big reasons. One is just that previously people said that it was too hard. It was just too much data. We’re talking about peak lists and search result files [that] are in the megabytes of size. So if you actually go back to the machine and take the raw data, you’re looking in the gigabyte range. And for the larger studies, it’s several hundred gigabytes worth of data.
Often it’s just not practical for people to share that much data. It’s just that they did not have an IT department, or they couldn’t figure out a way to keep that much data across the Internet connection.
Another reason has nothing to do with sharing the data, and it’s that a lot of people are concerned that if they share their data, someone might possibly scoop them. This comes up all the time with biomarkers where most people [believe] it’s fair that they should share all data they’re publishing. So if they’ve interpreted some results, found some peptide [or] protein identifications, they should have to share the data that helped them support those claims.
But the data that wasn’t necessarily identified, there’s a bit of a gray area there. Some people believe you should share it because you already had your first go at it, and other people might be able to infer more information. But other people think that until you actually use it, you shouldn’t have to publish it. What if it contains your biomarker, what if it contains some critical knowledge that someone else can scoop from you?
A lot of these research labs, their funding come primarily from the research they’re doing. A lot of the times, being able to actually generate the data is something unique, so a lot of labs are trying to protect that while still playing fair with everybody else by sharing as much data as they can.
What needs to be done to change that?
The first big step is [for] the journals to start saying, ‘Look we’re going to have these guidelines coming out.’ They didn’t put a hard date on it, but the scene was clear that more data needs to be shared.
In addition to that, the funding sources like the [National Cancer Institute] and the [National Institutes of Health] made it clear that they want the data shared. But they’ve yet to really enforce it strictly so that people who are funded by them have to share data in a specific way.
Those are the two big things that would really help get the data out there. But also just having a tool that makes it easy enough to share the data is a huge social component. A tranche was more or less lacking, at least in the field of proteomics.
These journals and the NCI and NIH, they’ve just recommended making the data publicly available. They’re not requiring it.
At the moment, as far as I know, that’s correct. All of them have [provided] nothing but recommendations.
Do you get any sense that they’re moving toward making it a hard rule?
Absolutely. I think the problem is you can’t just make one rule that’s going to apply to every case. [But] most everyone, at least in casual conversation, really wants to see some sort of framework or stricter guidelines [that say] ‘If you want to publish data, here’s what you have to put online with it.’
Figuring out exactly how that can apply to the many different research labs and research situations is something that still needs to be fleshed out.
What has been the reaction from the research community to the Tranche project?
I guess I’m biased, but we’ve seen tremendous success. So far, we’ve had no problems working with the other websites that have been the host of significant amounts of proteomics data, such as Peptide Atlas, the Open Proteomics Database, and the Global Proteome Machine Database, so it’s been great to have support from those people who are handling the more practical situation of what to do with existing data.
We’ve also had tons of researchers just start uploading data. Several organizations [such as] the NCI now are starting to use us for their mouse model studies. The ABRF group has been using us for some of their studies to hold all of their raw data.
How does your project fit in with other efforts such as the Peptide Atlas and the European Bioinformatics Institute’s PRIDE?
Before Trache, there really wasn’t a site dedicated to trying to share as much of the raw data in proteomics, broadly meaning everything down to the data coming directly off the machine, as possible.
The Peptide Atlas is sponsored by the [Institute for Systems Biology] primarily, and they work on the Trans Proteomic Pipeline, which is a very popular software package for analyzing mass spec data, and they were hosting data because no one else really was.
Their intention was to get the data so they could analyze it, put it through TPP, and then share the results with the community, so people could see how the TPP works and see kind of a third-party data analysis. And their intention was also to help support their Peptide Atlas database.
So they’re not really in the business of wanting to share data, but they kind of filled the need for lack of something else [being available]. So how Tranche fits in with them is we’re now taking on a large part of the responsibility of actively collecting datasets and making sure people can put large datasets, especially very large datasets, in the Tranche, and then Peptide Atlas can access them as they please. So when they want to download the data and process it for TPP, we’re there and feeding them data. They don’t have to worry about that part of the analysis.
The EBI is a slightly different story. The EBI system, PRIDE is much more of a … I won’t say a LIM system, but the intention is to take search results and annotations from publications and put them online and make them easily accessible and organized in a logical manner.
Normally, data going into PRIDE does not include the raw data at all. It includes just inferred identifications and sometimes the peak lists. That’s because PRIDE and the EBI have been great supporters of these proteomic standards, the [Human Proteome Organization’s] PSI inititatives. So PRIDE’s probably the best reference and implementation that you can find of the standards that are being proposed by the PSI and the best chance of actually seeing them in action and working.
If you see those standards, they’re very verbose, and they’re not targeted to shuffling around big globs of raw data. PRIDE’s a nice system that does not handle raw data is a short way to say it.
Tranche is a nice system that handles lots of raw data and makes it easily accessible to systems that don’t want to handle raw data.
What’s the next step in the evolution of Tranche?
First, we wanted to get it out and working and then show people it works and then obviously publish a publication, so we can get some credit for our work and for the sponsorship from the National Center for Research Resources. After that, it’s a full-blown open source project. It’s just like the rest that we’ve put up on ProteomeCommons.org.
At that point in time, which is right about now, our goal is to build as much of a community around it as we can and to get as many coders both from our groups and from the other groups actually developing and using the code base, so that we move from a system where there’s one grad student or one person in the lab supporting and developing the project to a state where there are many people that can maintain and support the project. And then the project’s also open to anyone else in the community.
Do you have any idea how people downloading the data are using it?
I can’t quite say what everyone is using it for, but I know several people are using it to re-analyze larger datasets. Some good examples are some current biomarker-type studies … particularly the larger ones like what ABRF is doing and [the NCI’s Clinical Proteomic Technology Assessment for Cancer] is doing.
I know people want to have a fair estimate of what others have seen. What they’re doing is they’re grabbing as much raw data of human serum, of plasma off Tranche and re-analyzing it using their own pipelines and comparing their pipelines against what the published results were and seeing if they’ve improved things, or seeing if things have changed, and then also using that to establish a baseline for their future studies.