Skip to main content
Premium Trial:

Request an Annual Quote

Broad Aims for Portable Pipeline as It Readies AWS-Compatible Version of Cromwell

Premium

CHICAGO (GenomeWeb) – A forthcoming update of the Broad Institute's Cromwell genomics workflow execution engine will allow full-fledged installations of the software to run natively on the Amazon Web Services cloud.

The upgrade, which should be out before the end of May, was accelerated by a one-day hackathon last month, featuring participants from as far away as Australia. Representatives from the Broad, Amazon, Australia's Melbourne Genomics Health Alliance, and a few other interested parties gathered in Cambridge, Massachusetts, to write new code aimed at making Cromwell fully compatible with the AWS environment.

The main goal of the hackathon was to get workflows from the Broad-developed Genome Analysis Toolkit (GATK) on an AWS installation of Cromwell, according to Jeff Gentry, software engineering manager at the Broad. Participants also wanted to pay attention to cost control and what Gentry called "user sanity" by adding in the ability for Cromwell to save results from earlier analysis, a feature known as "call caching," he said.

"If I run something once and spent the money and I want to run it again, I just want Cromwell to be smart enough to reuse those results," Gentry explained.

The participants also worked on cleaning up various bugs in previous, piecemeal attempts to move Cromwell to AWS, including support for Amazon Simple Storage Services (S3) files in the cloud environment.

Cromwell compatibility with AWS will, among other things, help some of the world's largest clinical genomics projects bring pipelines closer to data stores to inform patient care, according to participants. Those include Melbourne Genomics' GenoVic clinical informatics platform and the genomics programs at Melbourne Genomics member Peter MacCallum Cancer Centre and the Seattle-based Fred Hutchinson Cancer Research Center.

Melbourne Genomics and Peter MacCallum sent representatives all the way from Melbourne, Australia. Fred Hutch was unable to have anyone present in Cambridge, according to Gentry, but was one of the original drivers of the hackathon.

Fred Hutch had a lot of Cromwell use cases it was trying to make work on AWS. The Seattle institution actually held a hackathon of its own in December. "That led to them building out a Cromwell system using Amazon for researchers there, and that's when they quickly ran into some of these problems," Gentry said.

Melbourne Genomics did similar, as Michael Milton, a bioinformatician there, found several incompatibility issues with Cromwell on AWS, according to Natalie Thorne, lead specialist in clinical genetics at the Australian genomics alliance. "Enabling full functionality of Cromwell on AWS is a challenge of global importance, both for clinical and research genomics initiatives, so that we can harness the efficiency and scale of cloud computing for bioinformatics and complex systems needs," Thorne said.

The Broad and Amazon also did a lot of preplanning for the hackathon, according to Gentry, with significant help from Melbourne Genomics. "I looked through the Cromwell code to try to pre-identify the likely cause of some of these issues," Gentry said.

A couple of months before the hackathon and half a world away, Milton and a team at Peter MacCallum started working on getting the Broad's Workflow Description Language (WDL) — one of two computer languages that Cromwell supports — running on its internal system, Thorne said.

The Broad developed WDL in the Google Cloud, but Melbourne Genomics runs its informatics mostly in AWS. So, Melbourne Genomics told both Amazon and the Broad that Cromwell needed to work on AWS.

"We're in some sense quite a unique, leading example because we have a clinical system for genomics in GenoVic, which multiple laboratories and hospitals are utilizing. We were really driving hard that this is something that needs to work on AWS," Thorne said. "If we were just researchers, we would probably work around it or try something else, but we wanted something that was going to work for all of our users."

Thorne said that the Broad did make Cromwell work on AWS with some "fairly simple examples," but Melbourne Genomics ran into compatibility problems when it started applying Cromwell to its diagnostic and clinical pipelines, so it reached out for help. That, plus similar activities at Fred Hutch, led to the Cambridge gathering.

Cromwell originally was built for on-premises installation, and Gentry said that it is "very unopinionated as to where you run things" in an on-premises environment. There also is support for the Alibaba Cloud, from Chinese internet giant Alibaba Group, and for Google Cloud. Though the Broad itself runs Cromwell in the Google Cloud, it has partnered with AWS on GATK since 2016.

However, the Broad found that Cromwell simply did not work optimally in AWS. "What if we collaborated with some of those customers, folks from AWS and the Cromwell developers, and tried to close that gap as quickly as possible? Could we get to a place where a lot of these customers were able to really get up and running on AWS?" Gentry said.

The parties had some specific goals for the hackathon. Melbourne Genomics refers to GATK as the "best practice" pipeline for genomics. "We wanted to get to the point where a fully functioning, complex version of GATK would be running on AWS via Cromwell," Thorne said.

The hackathon addressed some of the "important little blockers in the way," according to Thorne.

"What we're trying to do is make what we call a portable pipeline," said Thorne, who oversees innovation and adoption at Melbourne Genomics. "The idea is that [our members] can make a pipeline in a workflow language and then have that run on their local system or run in AWS," she explained.

"We want to make sure that if someone writes a little pipeline, they can run it on their own service, and then when their organizations are comfortable for them to move to the cloud, they can then start using that pipeline in the same form on the cloud," Thorne said. "The problem is that we were having issues running it on people's local clusters as well.

The hackathon allowed each party to understand what each participant's skills are and develop a path forward, according to Thorne. Since the event on April 18, the Broad has been busy tying up loose ends and trying to merge in all of the independent Cromwell development by users like Melbourne Genomics and Fred Hutch, Gentry said. Work also is continuing on documentation for the new code, following fairly well-defined timeframes and roadmaps.

"The devil is in the details, but at the high level, we got through our stretch goals and started finding new things to work on," Gentry said.

It is still a little premature to know if the partners met the goal of running the "best practices" workflow of GATK in an AWS build of Cromwell.

"We're still in the process of folding all the work into the Cromwell code base. We don't have an example of 'everything's all in one place,' but even if the answer is no, it's not going to be very far away," Gentry said. Prior to the hackathon, that goal seemed "light years away," he added.

The Scan

And a Fourth?

A fourth dose of the Pfizer-BioNTech SARS-CoV-2 vaccine in an Israeli study increased antibody levels but did not prevent Omicron variant infections, according to the Financial Times.

For Better Science Software

A virtual institute funded by former Google CEO Eric Schmidt's philanthropy aims to lure software engineers to academia, Science reports.

Recommendation Explanations

The New York Times writes that the US Centers for Disease Control and Prevention is straining to both make and explain decisions based on limited information.

Genome Research Papers on De Novo Mutation Rates, Polyploid Genotyping, Oncogene Epigenomic Translocation

In Genome Research this week: de novo mutations rates in hemoglobin subunits, analysis of variant calling methods for polyploid plants, and more.