NEW YORK (GenomeWeb) – Electrical engineering and computer science researchers at the University of California, Berkeley have developed ADAM, an open source resource that offers a data format, application programming interfaces, and command line tools for analyzing and processing genomic data in a distributed environment.
Frank Nothaft, a graduate student in computer science at UC Berkeley and one of ADAM's developers, presented the tool during one of the sessions at this year's Bioinformatics Open Source Conference, one of the special interest group meetings held prior to the start of this week's Intelligent Systems for Molecular Biology conference in Boston. He explained during his talk that he and his colleagues developed ADAM to provide a processing pipeline that could make efficient use of both cluster and cloud infrastructure; provide a data format that has efficient parallel/distributed access across platforms; and support more flexible data access patterns.
ADAM is one part of a larger project called Big Data Genomics, which focuses on tools for analyzing genomic data. In addition to ADAM, developers involved in the project are working on so-called BDG formats which provide data schemas used to represent reads, variant calls, assembly contigs, and annotation. They are also developing software for sequence assembly, variant calling, and RNA analysis.
About six months ago, Nothaft and colleagues published a technical report on ADAM. Some of the information it contains is now out of date, Nothaft told BioInform in a conversation after his presentation, but the document still provides a good technical description of ADAM, its components, and the need that it's meant to address.
Specifically, ADAM addresses some of the limitations of current genomics data formats and processing pipelines, which are not designed to scale to large datasets, the report stated. The SAM/BAM formats "were intended for single node processing," they wrote. Although "there have been attempts to adapt BAM to distributed computing environments … they see limited scalability past eight nodes." Additionally, in the absence of an explicit data schema, "there are well known incompatibilities between libraries that implement SAM/BAM/VCF data access."
Some researchers have tried to adapt current file formats like BAM to work with these big data processing systems, "but the style of compression that they use and the way they are saved on disks fundamentally limits the performance of the system," Nothaft explained. "We've defined a schema-based format which allows us to operate at a higher level." ADAM stores data using an efficient columnar format called Parquet — a storage format created by Twitter and Cloudera that is designed for "distribution across multiple computers with high compression," according the UC Berkeley report.
"The big benefit there is that it provides very high-performance file access and … good compression," Nothaft said. "We are able with this file format to achieve a 5 to 25 percent reduction of size in the [data] files [and] that’s attractive to people who want to archive data." However, for users who prefer the old data formats, ADAM is compatible with those as well, he added.
ADAM also offers two APIs. The first, according to the report, is a data format/access API implemented on top of Parquet and Apache Avro, a cross platform/language serialization format. The second is a data transformation API which is implemented on top of Apache Spark, an open-source high performance in-memory computing framework that provides data schema access in multiple programming languages and improves performance through in-memory caching and by reducing disk I/O, among other benefits.
In developing the API, the goal was to provide "some of the canonical transformations that people are using in human genetics processing pipelines on top of a scalable computing platform," Nothaft explained. "We started by building out some of the pipeline stages that are commonly used in the [Genome Analysis Toolkit's] best practices on top of Apache Spark [and] at this point, we have several of those implemented."
So far, he said "we've demonstrated full concordance for base quality score recalibration, as well as duplicate marking, and we are developing an indel aligner, as well as downstream variant calling tasks." The team is also working on operations to validate data, run data quality checks, and analyze variants, he said.
Finally, command line access gives more expert users the option to directly "apply all of the transformations that we have, as well as all of the data conversion and sanity checking steps that we've built" without going through the API, Nothaft said.
ADAM's developers have been working on the resource for just over a year. So far, besides the UC Berkeley team, researchers at eight institutions are contributing to ADAM's development, Nothaft told BioInform. And there are some researchers who are preparing to use it in a pilot project.
At present, the development team is gearing up to release by the end of this year a production-ready version of ADAM that will offer a complete pipeline alignment through to variant calling. The code is available right now from the Maven repository but it's very much in a "power user condition," Nothaft said. "We are hoping to release packages for the software so that it will be easy to install." The developers are also working to benchmark the system against best practices pipelines such as the GATK and FreeBayes, he said, and to make ADAM interoperable with the API provided by the Global Alliance for Genomic Health.
The developers plan to publish a peer-reviewed paper about ADAM by the end of the year.