Whether we're working at the bench or in front of a computer, most of us have experienced the twists and turns of research, not knowing the best way to address a research question until we've tried answering it with a variety of methods and techniques. After that circuitous path, however, we need to be able to tell others how we arrived at our conclusions in a complete but concise manner, probably not including all the detours and wrong turns. We should be able to reconstruct our steps at the bench with our trusty lab notebook, and detailing our final analysis at the computer seems like it should be much easier.
Nevertheless, documenting our computational pipelines and sharing them effectively with others when the work is published is generally not trivial. At the other end, we've all read papers where it's not very clear how the investigators got from A to Z. Here we share some issues and solutions for reproducible research — some easy and some much harder to implement — that we've come across in our bioinformatics work.
The main idea behind reproducible research for computational projects is that publishing a paper is only part of the story; we should also be sharing the data, software, and methods used to produce all of the analysis, statistics, and figures. This implies that a knowledgeable reader, given this information, could then reproduce the findings of the author. Many publications may achieve part of this goal, but very few go far enough, often still requiring some leaps of faith — or trial and error — for one to replicate a computational analysis. Electronic publications that include supplementary information make it easier to implement reproducible research, since one can attach detailed data, methods, code, and other sorts of files to a publication, but why is published computational biology research still so much of a black box? It's easy to think about why this happens: it takes good organization and a lot of work to clearly present every step of an analysis from start to finish. We aren't any better at doing this right than others, but we've thought about some ways that seem to move us in the right direction.
Where's the data?
To reproduce any analysis, we need the data. The growth of biological data repositories is a major step in the right direction for big data sets. Requirement for submission of DNA and protein sequences to GenBank started things off, and more recent requirements for submission of microarray data to GEO or ArrayExpress, for example, makes it much easier to find data from recent publications. Small data sets, however, have somehow gotten lost in the shuffle. Where is the data behind that bar chart in the paper you just read? If we want to try another statistical test on the data from that figure, we need all the data points, not just summary measures. It would be great if all the numbers behind each figure were somewhere nearby, such as the website of a repository, publisher, or author. Whichever is chosen, we'd like to have a Web address that would be around years from now.
Since most people would like to be able start at the beginning, we also have to decide what we're calling raw data. For microarrays and high-throughput sequencing, most people seem to agree that a numerical matrix or a set of sequences, respectively, is more useful than the images behind them. Also, lots of popular databases are updated frequently — as often as daily — so if our analysis started with all mouse RefSeq transcripts, either we or NCBI has to provide a snapshot of the database reflecting its status at that specific date.
How did they do the analysis?
Effectively documenting and sharing every step of an analysis pipeline is another key part of reproducible research. How did we get from the raw to the processed data, and could we quickly re-run the whole analysis with a modified set of data? Even if we only used publicly available bioinformatics and Linux tools, what order of processing did we follow, and which parameters did we use for each command? We've been getting more successful at documenting all projects on a high level with a shell script of commands, commented to describe why each step was performed. This works fine for our group, since we're all working on the same computer system, but it's not quite enough to share with others.
If we really want to do this seriously we'd have to include information about the programs we're calling (Where did we get 'cluster'? What version?) and what computer we're using (Linux? Windows? Does it work the same on both?). Also if we want to do this right, we'd probably want a system that deletes the old files before creating new ones. We can really run into trouble when we use programs from a graphical interface. The easy-to-say solution would be "use only command-line tools," but we (and especially our bench colleagues) happen to like some GUI packages. Even though this prevents us from ever having one-click reproducibility, as long as we document what we do, every step can still be reproduced. Another potential wrench in the works is the use of software requiring a license. Is it reasonable to expect others in your field to have access to Matlab? How about Transfac?
Where's the custom code?
A lot more work is involved in reproducibility if we're writing our own analysis routines. Assuming we've shown the details used to run each command, we'll still probably forget the significance of 'T' as the third argument. As we get more disciplined, we've been writing code so running the command by itself prints a message describing how to use it and (if we're trying extra hard) a description of the format of each input file. Is there a block diagram or pseudocode that summarizes the algorithm? Maybe all someone wants is a detailed description of what's going on. Hopefully we can release our software as open source, but this may mean that we want to spend extra time cleaning up the ugly bits.
Even if we share our code, few will want to wade through it unless there's good enough commenting to guide them. If our cryptic code lacks comments, we could argue that others can still reproduce the analysis, but we're not being very helpful. Meanwhile, should we be sharing source code or executables? Hopefully both, so others can get the program up and running without compile-time troubles — and look at the source in more detail if they want. If we really want to write robust, easy-to-run code that others will use and cite, we may have to spend some time reading that software engineering book after all. Lastly, many hesitate to release code simply because they don't want to be contacted with questions or problems. We'll need someone who can answer e-mails or correct bugs years after publication.
How did they make those figures?
A final part of reproducible research is creating reproducible figures. Being that a single figure can often involve several rounds of processing, including spreadsheet, statistical, and/or graphics programs, this area of reproducible research is generally the hardest for us to achieve. The R statistics package can generate beautiful, complex figures, but learning how to write the code to do so has taken us a long time. As a result, our only practical compromise for partially reproducible figure generation in many projects is sharing the file from the spreadsheet or statistical application that turned our numbers into graphics.
We're convinced that doing reproducible bioinformatics research — even if we don't advertise it in publications quite yet — is a good thing for us and our group. It's much easier to do if it's set as a priority at the beginning of a project, and at the end we'll be that much closer to making the publication reproducible. Some have even found that reproducible research tends to be cited more than non-reproducible publications. If we can increase the level of reproducibility in the bioinformatics literature, it'll make every publication even more valuable to the community.
Fran Lewitter, PhD, is director of bioinformatics and research computing at the Whitehead Institute for Biomedical Research. This column was written in collaboration with George Bell, PhD, a senior bioinformatics scientist in Fran's group.