Pacific Northwest National Laboratory's (PNNL) Ronald Taylor has published an overview of Hadoop, the popular open-source software framework the supports data-intensive distributed applications. Taylor's paper in BMC Bioinformatics looks at how Hadoop has been adopted by the bioinformatics community, with a specific focus on next-generation sequencing.
Hadoop, an open source implementation of the MapReduce programming paradigm — a framework for processing huge datasets developed by Google — is a cost-effective method of analyzing data on commodity Linux clusters and the cloud. Taylor also discusses some of the major open source project that are built on top of Hadoop, including the Hive framework used for ad hoc querying with an SQL type query language, and Pig, a high-level data-flow language for bath processing of data.
The Magellan project, a joint research effort of the National Energy Research Scientific Computing Center (NERSC), Lawrence Berkeley National Laboratory, and the Leadership Computing Facility at Argonne National Laboratory (ANL), uses Hadoop and HBase, a non-relational distributed database, on a cluster at NERSC and have been run using Hadoop in streaming mode for BLAST computations. NERSC is also evaluating the use of Hadoop and solid state storage, a low-energy memory technology that is being explored by the HPC community.
Taylor concludes that "for much bioinformatics work not only is the scalability permitted by Hadoop and HBase important, but also of consequence is the ease of integrating and analyzing various large, disparate data sources into one data warehouse under Hadoop, in relatively few HBase tables."
For a good breakdown of Hadoop and the history of MapReduce, check out this video: