In 2004, Google published a white paper describing its programming models, MapReduce and the Google File System, for handling big data sets. Impressed by what he had read, open-source software developer Doug Cutting soon began work on an open-source distributed computing platform that includes implementations of MapReduce, called Hadoop — named after his four-year-old son's stuffed elephant.
The list of organizations and companies currently using Hadoop as the foundation to their services reads like a who's who of Web stalwarts: Amazon, Facebook, LinkedIn, and Twitter all use Hadoop. Cutting's platform enables applications to work with petabytes of data distributed across thousands of nodes. That capability — along with the fact that it's free, runs well on Linux clusters, and requires no hardware modification — is the primary reason for its increasing adoption in the bio-informatics community.
The University of Maryland's Michael Schatz helped introduce the life sciences community to Hadoop when he released his CloudBurst software in 2009. CloudBurst is a parallel read-mapping algorithm optimized for mapping next-generation sequence data to reference genomes, and is one of a slew of bioinformatics programs that populate the Hadoop software ecosystem, supported by the Apache Software Foundation. This ever-growing collection includes applications, databases, application programming interfaces, and data-flow languages. Like Schatz's other Hadoop-based applications, CloudBurst can run not only on Linux clusters, but also on the cloud — which is another reason for its popularity among bioinformaticians.
"We've definitely seen an uptake in adopting Hadoop in the life sciences community, mostly targeting next-generation sequencing, and simple read mapping because what [developers] discovered was that a number of bioinformatics problems transferred very well to Hadoop, especially at scale," says Deepak Singh, principal product manager of Amazon EC2 at Amazon Web Services. "The other area where we're starting to see lots of interest is pharmaceutical companies using Hadoop for exploratory analysis because one of the nice things that it does is it allows you to forget about data format and just collect all of this data without necessarily knowing what you're looking for. Then you can start building hypotheses."
Another reason for Hadoop's growing adoption rate in the research community is that it is relatively easy for parallel programming or petabyte-scale database novices to develop programs for their specific needs that will run seamlessly on the framework. While the initial implementations of Hadoop required users to have sharp Java programming chops, newer releases and high-level abstraction layers developed within the ecosystem have made writing programs more accessible. Some of these include the recently released Pydoop, a development environment that enables non-experts in distributed programming to write Python code to run on Hadoop.
"A lot of parallel code tends to break down at scale when you have lots and lots of data, but Hadoop takes care of that for you — it's built into the framework and you can take away the little bits you worry about, so that's why it's becoming popular in bioinformatics," Singh says. "As data sets become bigger and bigger, the bioinformatics community is going to look at Hadoop as more of an end-to-end data management platform."
One example of such an end-to-end solution is Fluxion's IonFlux drug discovery platform combined with the Amazon cloud, which uses the Hadoop framework. Data comes straight off the instrument into a pipeline to the cloud, where users can manage everything from assembly to analysis to variant-calling as well as data warehousing.
Luca Pireddu, a researcher in the Distributed Computing Group at Italy's Center for Advanced Studies, Research and Development in Sardinia, added yet another bioinformatics tool to the Hadoop community in June. Called SEAL, Pireddu's new application is a scalable tool for short read pair mapping and duplicate removal. On a 16-node Hadoop cluster, SEAL can process about 13 gigabytes per hour in map+rmdup mode, which removes potential PCR duplicates while reaching a throughput of 19 gigabytes per hour in mapping-only mode.
"The volume of data that bioinformatics applications have to analyze is steadily growing such that it has become desirable, and often necessary, to build distributed bioinformatics applications to be able to compute results in reasonable times," Pireddu says. "Hadoop and MapReduce are well suited to many bioinformatics applications, and in particular, the ones that handle large data sets [because] they can easily handle entire genomes and even greater amounts of data when given a reasonable number of nodes on which to run. … On the contrary, many common tools cannot take advantage of additional hardware, even if it is made available."
The initial challenge developers face is that they must decide whether Hadoop and the MapReduce model are the best solutions for their problems. Essentially, developers must determine where the bottleneck is in their particular workflow — for example, whether it is the processing power, memory size, or disk bandwidth causing the holdup. Next, potential adopters need to adopt a mindset to design algorithms within the MapReduce model, Pireddu says. "This model imposes that algorithms be composed of two distinct steps: map and reduce," he says. "The map function is taking the input records and transforming them to inter-mediate key-value pairs, then for each unique key, the reduce step is then invoked with all the values associated with the particular key. … Most developers may not be used to seeing algorithms in this way, but the ones who have had experience with functional programming will surely feel more comfortable in this environment."
[Sidebar] Hadoop Tools
In addition to read mapping, researchers have also developed Hadoop tools for phylogenetic analysis, data-intensive bioinformatics workflows, and sequence alignment — for which there have been a number of efforts porting Blast to Hadoop clusters and the cloud. The Distributed Computing Group at the Center for Advanced Studies, Research and Development in Sardinia has implemented Blast and Gene Set Enrichment Analysis in Hadoop using a Python wrapper for the National Center for Biotechnology Information C++ Toolkit. That same group developed Biodoop, a publicly available suite of parallel bioinformatics applications based upon Hadoop.
Andrea Matsunaga at the University of Florida has also used Hadoop to create a parallelized version of the NCBI Blast2 algorithm, CloudBlast. Matsunaga tested CloudBlast against the publicly available version of mpiBlast — which is arguably the most popular parallel version of Blast — and found that CloudBlast exhibited better performance, and was also simpler to develop and maintain.
Here are just a few current Hadoop-based bioinformatic applications:
Crossbow: Whole genome resequencing analysis; SNP genotyping from short reads
Contrail: De novo assembly from short sequencing reads
Myrna: Ultrafast short read alignment and differential gene expression from large RNA-seq data sets
PeakRanger: Cloud-enabled peak caller for ChIP-seq data
Quake: Quality-aware detection and sequencing error correction tool
BlastReduce: High-performance short read mapping
CloudBLAST: Hadoop implementation of NCBI's Blast
MrsRF: Algorithm for analyzing large evolutionary trees