CHICAGO – The recently released Amazon Genomics CLI, an open-source software tool for managing and processing large-scale genomics data on the Amazon Web Services cloud, is part of the internet giant's previously stated goal of alleviating computational "heavy lifting" for its customers in genomics and life sciences.
Pat Combes, AWS worldwide technical leader for healthcare and life sciences, described Amazon Genomics CLI as another "on ramp" to AWS for those looking to build or grow large-scale sequencing programs quickly without in-house bioinformatics infrastructure or expertise.
CLI stands for command-line interface, a common term in computer programming to describe text interfaces between users and operating systems.
AWS announced a preview for CLI in July and released it to the open-source community on Sept. 27.
Amazon Genomics CLI specifically is meant to improve integration of a wide range of popular bioinformatics software with AWS services, including the Amazon Elastic Container Service (EC2), Amazon Elastic Compute Cloud, Amazon DynamoDB, and the Amazon Simple Storage Service. Conceptually, Combes likened Genomics CLI to AWS ParallelCluster, an open-source application for managing high-performance computing clusters on the Amazon cloud.
"You are using it to drive the creation of a lot of resources on AWS or whatever is necessary in terms of EC2 storage allocation and so on," Combes said. "Then you bring your workflows into it … and it just runs."
Combes said that Genomics CLI lets users bring tools such as the popular Nextflow workflow management system onto the AWS cloud without needing deep informatics infrastructure.
The CLI is free, but users pay for other AWS resources they might need. Combes said that it is not so much a loss leader, but a way for users to improve their management and consumption of those other products.
While there are other AWS resources and products that seem to work well across industries, the company found a need to develop the Genomics CLI because, according to Combes, it can support petabyte-scale data processing.
"We're really hoping that it helps take our customers a little farther than they were before, and … helping them set up those at-scale programs that they need to run," Combes said.
He also said that life sciences customers tend to fall into two camps.
"We have a lot of customers who have built out significant, quite large-scale genomic sequencing programs on AWS and done so successfully. And then we've had a number of customers who were having trouble just getting started with that."
An early adopter is the University of California, Santa Cruz, Genomics Institute. UC-Santa Cruz is not a small organization, and it has a well-established genomics program, but it is the type of entity Combes wants to serve because the university has a lot of data but needs help discovering and running genomics workflows.
"In order to get them to execute [their genomics program] successfully, they need to leverage a lot of easily accessed and expandable resources, which we provide," Combes said.
"They have a good idea of what they want to do, but they don't know how to effectively get started without consuming too many resources," Combes said. Such users often do not know how to manage their genomics computing needs well and end up with what he called a "real inconsistent experience."
Added Combes, "The CLI is really meant to bring some uniformity and consistency and governance to that approach."
The UC-Santa Cruz Genomics Institute, which builds computational platforms for genomics, runs most of its workflows on AWS.
The UCSC Genomics Institute led the development of Dockstore, an open platform that allows bioinformatics researchers to share Docker-based genomics workflows. It was funded by the National Institutes of Health, so it must follow the FAIR principles of being findable, accessible, interoperable, and reusable.
Timothy Harris, director of the computational genomics platform within the UC-Santa Cruz Genomics Institute, said that his organization has been talking with AWS for perhaps nine months about integrating Dockstore with the CLI. With any CLI, "you can use Dockstore to find any existing workflow that bioinformatics engineers have put into the repository and then execute them directly on the CLI," he explained.
Harris said that the Amazon Genomics CLI specifically makes it easy for organizations like his institute to run the Global Alliance for Genomics and Health (GA4GH) Workflow Execution Service. "Since we are interested in building repositories for bioinformatics workflows, we need ways to execute them, and [the CLI] allows us to do that from a perspective that's pretty lightweight," he said.
Harris noted that users pay for computing resources whenever they access cloud space, so the AWSn model is not unique. Under a series of agreements with NIH, AWS stores copies of public genomic datasets that researchers can use for free. They just pay for the computing resources to run analyses.
"Our goal is generally to provide computational platforms that can accelerate bioinformatics research or genomics research across the globe," Harris said. "We use [the CLI] internally because it's really easy to spin up and access."
The UCSC Genomics Institute works with both microbiologists and computer scientists, and each constituency tends to lack skills the other has.
"For us, working with CLIs is just kind of a natural process and how we do most of our work already, so we find it really useful to have something [like Amazon's CLI] that is really easy to spin up and access," Harris said. "We appreciate the lightweight nature of a CLI-based approach."
Beyond integrating Dockstore, Harris and his team have not decided on any future uses for Amazon Genomics CLI.
"We're still getting used to the platform and using it internally," he said. "But yeah, it's the beginning of a collaboration with Amazon that we hope to continue."
Combes said that he envisions the CLI as providing customers with more resources to use within EC2, such as extending AWS to the full extent of the GA4GH Workflow Execution Service application programming interface.
Combes said that the CLI fully implements the Workflow Execution Service and will help GA4GH improve the standard by providing feedback from AWS' own experience with the API.
Combes said that the CLI is also meant to help genomics users manage other AWS costs. For example, the service might drive customers to the AWS spot market for EC2, where users "can acquire high-powered resources at really low cost for containerization," he explained.
"All these things are meant to help them structure their workloads so they can take advantage of the widest set of resources at the lowest possible cost," Combes added.