GA4GH 2020-2021 Roadmap

Cloud Work Stream

Product Roadmap

Motivation and Mandate

The GA4GH Cloud Work Stream (CWS) helps the genomics and health communities take full advantage of modern cloud environments. Its initial focus is on ‘bringing the algorithms to the data,’ by creating standards for defining, sharing, and executing portable workflows. Standards under discussion include workflow definition languages, tool encapsulation, cloud-based task and workflow execution, and cloud-agnostic abstraction of secure data access.

Existing Standards

The CWS will build heavily on Docker for packaging of executables, and on existing text-based orchestration languages such as CWL and WDL for stitching those executables together.

Proposed Solution

The CWS will work with a variety of GA4GH Driver Projects including the NIH Genomic Data Commons, Genomics England, and other large-scale data processing efforts. These Driver Projects provide clear use cases for new standards, and deployment environments for specific implementations. As a result of our collaboration, the Driver Projects will have the ability to utilize standards to enable better tool, workflow, and data sharing with the larger community. Standards we push forward will address the following needs:

  • Defining portable workflows: tool builders need to be able to package their tools for reuse. The CWS will build on existing standards to allow workflows built by one researcher to be used by many others.
  • Sharing portable workflows: tool builders need to be able to offer their tools for others to use, and tool consumers need to be able to discover the tools they need. The CWS will support app-store-like functionality, including support for controlling access if tool builders choose.
  • Executing portable workflows: once a tool consumer has selected a tool, they need to be able to run it in their preferred compute environment, pointing at input and output data in their preferred storage environment. The CWS will define execution standards that will be easy for developers of existing workflow runners (e.g. Toil, Cromwell, Rabix) to support.

To ensure these standards meet the needs of GA4GH Driver Projects, the CWS will build a workflow portability testbed environment. Driver Projects will contribute one or two packaged workflows they care about, together with test input data and an output verifier. The CWS will then ask each Driver Project to run an instance of the testbed in their local environment, including contributing back patches to make it portable if needed, and to run all of the test workflows in all of their environments. Success will demonstrate the real-world usability and utility of Cloud Workstream standards.

Planned Deliverables

Cloud Testbed 

  • Type: Technical Toolkit
  • Expected Submission Date: Q3 2021
  • Requesting Driver Projects: {add}

GA4GH web services require compliance testing to ensure they fully conform with GA4GH API specifications. The GA4GH Testbed infrastructure will serve as a common platform to configure, schedule, and launch compliance tests against any web service implementing a GA4GH API. Test results will be loaded into a reporting service, which will display information on which implementations passed and failed certain test scenarios. The testbed infrastructure will extend not only to compliance tests, but also system performance benchmarking tests, and end-to-end tests, such as the Federated Analysis Systems Project (FASP), which simulate researcher use cases. The testbed will provide a comprehensive picture of how well our specifications are being implemented and identify when certain features are not implemented correctly. This will drive uptake by making it easier for developers to debug, maintain, and update their deployments, all while ensuring they remain conformant with GA4GH standards.

Task Execution Service API

  • Type: API
  • V1 Expected Submission Date: Q2 2021
  • Requesting Driver Projects: ELIXIR Cloud, GeL, AGHA

Every compute environment has a different API for the batch execution of tasks. For example, each of the three major cloud vendors provides this service, but using completely different APIs. By providing a common interface that abstracts over their differences, compute engines can quickly move from one compute system to the next.

Approved Deliverables 

Data Repository Service API

  • Type: API
  • V1 Approval Date: 2019
  • Known V1 Implementations and Deployments: Cavatica, iRODS Consortium, Terra, Cancer Genomics Cloud (CGC)

The Data Repository Service (DRS) API, a standard for building data repositories and adapting access tools to work with those repositories, works with other approved APIs from the GA4GH Cloud Work Stream to allow researchers to discover algorithms across different cloud environments and send them to datasets they wish to analyze. The API allows data consumers to access datasets regardless of the repository in which they are stored or managed.

Tool Registry Service API

  • Type: API
  • V1 Approval Date: 2019
  • Known V1 Implementations and Deployments: DNAstack, Terra

The Tool Registry Service (TRS) is a standard API for exchanging tools and workflows to analyze, read, and manipulate genomic data. The ​TRS API is one of a series of technical standards from the Cloud Work Stream that together allow genomics researchers to bring algorithms to datasets in disparate cloud environments, rather than moving data around. TRS gives researchers access to far more tools than they can presently use, and allows developers to register their products so that they are visible on a multitude of sites, expanding their audience reach. The API also provides a set of requirements for tool and workflow registries to implement TRS.

Workflow Execution Service API

  • Type: API
  • V1 Approval Date: 2018
  • Known V1 Implementations and Deployments: Broad Institute, TOPMed, Human Cell Atlas, All of Us, Australian Genomics, Genomics England, ELIXIR

Portable tools — the ability to execute a single analysis in a variety of environments — allow researchers to work with more data from more sources, and tool builders to support more researchers and more use cases. The Workflow Execution Service (WES) API provides a standard for exactly that. This API lets users run a single workflow (defined using CWL or WDL) on multiple different platforms, clouds, and environments, and be confident that it will work the same way. The API provides methods to request that a workflow be run, pass parameters to that workflow, get information about running workflows, and cancel a running workflow.