26 October 2018
The GA4GH Cloud Work Stream has announced version 1 of its Workflow Execution Service (WES) API — a protocol for running the same genomic data analysis in multiple cloud environments. The announcement was made at the GA4GH 6th Plenary Meeting in Basel, Switzerland earlier this month.
“WES enables users to define workflows in a standard way, package them up, and then hand them to workflow engines that live in many different places,” said David Glazer, Engineering Director at Verily Life Sciences and co-lead of the Cloud Work Stream. “You should be able to run your workflow wherever you want on whatever data you have and be confident you’ll get the same answer — WES allows you to do just that.”
WES is part of a larger framework to seamlessly bring algorithms to genomic data rather than attempting to transfer that data across national and institutional bounds, which is cumbersome, resource intensive, and limited by regulatory constraints. “Our task as a work stream is to come up with the standards to facilitate the definition of these portable workflows, the sharing of them, and the execution of them,” said Glazer.
Taken as a whole, this framework will have positive impacts for both tool developers and tool users — developers will only need to package their tools once to make them available to the broad community, and researchers will have ready access to more tools as well as the ability to run compatible analyses across data in many places. WES addresses that last step in the framework — the execution of portable workflows in a preferred compute environment using data in a preferred storage environment.
In addition to developing WES, the Cloud Work Stream has also been spending its time since it was founded twelve months ago developing an interoperability testbed to show its framework works. “We want not only to demonstrate that the workflows are interoperable, but also to confirm they are actually successful,” said Brian O’Connor, Consulting Director at the UCSC Computational Genomics Platform and co-lead of the Cloud Work Stream.
The testbed depends on an “orchestrator” that sits between a library of workflows and a series of cloud environments where those workflows can be run. In this role, it is responsible for directly testing workflows in different WES environments. It selects a workflow and then selects a cloud environment that has implemented WES where the analysis will be run. The orchestrator monitors the workflow along the way, spitting out details used by the Work Stream to confirm WES is working in each environment.
To date, four organizations have made their implementations of WES available to the testbed: Veritas Genetics (running on the Microsoft Azure cloud), the Human Cell Atlas (running on the Amazon Web Services cloud), the Broad Institute of MIT and Harvard (running on the Google Cloud Platform), and Illumina. The interoperability testbed shows that WES is working in all four cases by running test workflows from three GA4GH Driver Projects and collaborators — Human Cell Atlas, Australian Genomics, and TOPMed — along with another contributed by the PCAWG project. These workflows were coupled with the test data where the results of the analysis are already known, making it easy to verify workflow portability across sites.
“I can say, ‘Here’s a test set of inputs to run the workflow on; if successful, you should get these outputs’,” said O’Connor. “The orchestrator runs every test against every environment and we can see how well each combination fares.”
Over the coming year, the Cloud Work Stream will work to publish standards that complement WES in making it possible to run any tool on any data: an API for selecting tools (Tool Registry Service, or TRS) and an API for accessing data across multiple clouds (Data Repository Service, or DRS).