Skip to main content
Premium Trial:

Request an Annual Quote

New Standard Seeks to Alleviate Challenges of Sharing Bioinformatics Tools and Workflows

Premium

NEW YORK (GenomeWeb) – Hoping to alleviate some of the difficulties associated with accessing and running bioinformatics resources, a group of researchers within the computational community came together last year, volunteering their time and expertise to develop standardized language for describing tools and workflows.

The Common Workflow Language (CWL) provides standard specifications for describing analysis tools and workflows that frees them from ties to single platforms and systems based on more restrictive formats. If the standard is picked up and adopted by the community, it could make it easier and less tedious for tool authors to pass their pipelines around as well as to install and run them on multiple workflow platforms. 

The CWL working group, which spans both industry and academia, met for the first time last year during Codefest, an informal two day coding meeting that takes place annually prior to the start of the Bioinformatics Open Source Conference, one of several special interest group meetings that are part of the Intelligent Systems for Molecular Biology conference. 

In the last 12 months, the group has written two drafts of the CWL and they are now opening up the most recent draft for public testing by interested tool authors with a full launch of version 1.0 possible within the next six to 12 months, Michael Crusoe, lead software engineer in the laboratory of C Titus Brown at the University of California, Davis and de facto project manager of the CWL working group, told GenomeWeb. The idea is to have the community test the specification out, identify any gaps or areas for improvement, and hopefully begin using the standard within their platforms. 

The CWL grew out of discussions between researchers about the difficulties of getting tools to run on large, well-used workflow systems such as Galaxy, iPlant, and Taverna. Tool developers often have to describe their tools using languages and formats that work well on one system but fail to translate to other platforms where researchers might want their tools to run, Crusoe said.

That leaves bioinformatics developers stuck in the middle, Peter Amstutz, a developer at bioinformatics company Curoverse and one of the contributors to the CWL drafts, said. "They have to either build their own [system], or they have to do [tool descriptions] over and over again" to work with different platforms.

The importance and need for standards is a recurring topic of discussion in the bioinformatics community, and there have been multiple efforts to standardize tools and methods of analyzing data. Although there was general agreement on the need for standardizing tools and workflows descriptions, "in practice, no one did anything about it ... until BOSC codefest last year" when as it turned out "all the right people were in the room," Crusoe said. Contributions to the CWL have come from developers involved in the Galaxy project and from companies like Seven Bridges Genomics and Curoverse, among others, some of whom were already working on standardization efforts within their own institutions and firms.

One resource-sharing option that has generated a modicum of interest within the bioinformatics community, and garnered interest from some CWL members initially, is Docker, an open-source platform for building, sharing, and running distributed applications inside software containers. It is used, for example, in Curoverse's Arvados platform and specifically in its data processing system, which uses Docker containers to hold the components of given pipelines and to define run environments for each component. Also, Micheal Barton, a researcher at the Joint Genome Institute who is developing a public repository of genome assemblers and associated benchmarks, makes the assemblers available as Docker images.

In addition, Nebojsa Tijanic, a software engineer at Seven Bridges Genomics and one of the CWL contributors, told GenomeWeb that his company had begun experimenting with Docker internally and was mulling standardizing ways of creating and sharing pipelines using the containers. 

Docker images are available as an optional feature for CWL users. The group offers a reference implementation that provides CWL-described resources as Docker images that can be installed and run locally. However, Docker has limitations, according to some of the CWL working group members that GenomeWeb spoke to. Containers provide pipelines without formal descriptions of how to use the tools packaged inside including details on the commands used for different tasks, what outputs particular processes yield when executed and associated formats, as well as details on data flow from one tool to the next, Tijanic explained. Those are additional layers of information that can be described with the CWL.

Also, Docker images are snapshots of tools in time and may not provide access to the most current iterations of the tools, and they can have some security issues, Crusoe said. Moreover, he doesn't believe that Docker has gained widespread acceptance within the bioinformatics community, at least not enough to make the most suitable medium for sharing resources. It's not currently implemented on many academic computing clusters, he told GenomeWeb, nor is it being talked about much.

In coming up with the specifications for the CWL, the working group borrowed ideas from existing projects such Workflow Forever and Apache Taverna, ensuring that the language actually captures sufficient detail to make workflows properly portable, Amstutz said. "A lot of the work ... focused not necessarily on adding lots of features but on really closely, carefully defining how things work."

The CWL language builds on technologies such as JSON and Avro and it uses Docker to provide portable runtime environments for the described tools and workflows. Specifically, it uses YAML, a version of the JavaScript Object Notation format, to define tools and workflows, Crusoe said. YAML, a data serialization format, is more readable than the traditionally used Extensible Markup Language format and has far fewer characters than its counterpart, he said. The CWL includes vocabulary for describing data flow though the workflow; mechanisms for accessing data in files on disk or by streaming; and how to disseminate tasks to run in parallel on compute clusters.

There is some work involved in actually using the CWL to describe the workflows, but it shouldn't pose too much of a problem for tool authors who are already required to make their code available as their papers are published. In fact, the extra bit of upfront work will ultimately prove worthwhile in the long run, according to Crusoe.

"Right now, as a tool author if I want my tool to be in all these [bioinformatics] systems, I have to describe my tool separately to all of them," he pointed out. "But if you just write the description once using CWL, then you don't have to wait for a dozen communities to come find your tool, decide if it's useful, and [then] write the description for their preferred environment.  It'll be ready for all the compatible environments."

Although it's still a work in progress, CWL's creators have begun adopting it into their platforms. Curoverse, for example, is working on implementing the language in its commercial platform, and hopes to make it the primary language for defining workflows within Arvados, Amstutz said. Meanwhile, Seven Bridges is implementing the CWL within its commercial platform and plans to make it the primary methods of describing workflows in the system. It also is implementing the CWL on a separate open-source platform called Rabix, which the company created for developing and running open-source pipelines and workflows, Tijanic told GenomeWeb. The system lets users wrap their tools and combine them into CWL-compliant workflows using a pipeline editor tool created for the purpose, he said.

Seven Bridges is also working on a tool that will pull text from help documents provided with tools and convert them into the CWL, helping to at least partially automate the process of generating descriptions, Tijanic said. There's also an option to wrap and share pipelines using Docker containers within Rabix. 

The CWL working group will meet again at this year's Codefest, and Crusoe will give a talk on the language during BOSC. Both events will be held prior to this year's ISMB conference later this month in Dublin.

"We are really excited to be coming back to BOSC [and] holding our conformance tests for Draft 2," Crusoe said. "Past experience [tells me] we'll learn a lot about where the standard does and does not hold up and we'll really mature it."

The group is already considering improvements, such as providing language for users to add extensible features or extra data or semantic markup to their descriptions, he said. They are also open to suggestions and ideas from the broader community and invite collaborators to connect via the group mailing list or its github page

In the meantime, "We are trying to do our due diligence and learn [from] what has come before us," Crusoe said. "If we skipped some great technology or description or other language that we should have known from [and] it solves all our problems, I'd be totally happy to tell the world, 'ignore us and go use this other thing.' This is what we think works right now, but we are not beholden to this."

So far, the group has not received any funding for its efforts, and it's not entirely clear if it will seek some form of financial support. It's a tricky question, Crusoe said. With projects such as these, "you want things to be only as big and complicated as they need to be and no larger," he said. "We'd rather err on the side of being a little bit too small than being overly complicated."