9 March 2021
Federated analysis of data distributed across the world can make genomics research more powerful by connecting multiple large-scale datasets for simultaneous analysis.
Such investigations utilize complex methods, such as aligning multiple sequences to the human reference genome to identify potentially pathogenic variants. These analyses often involve up to hundreds of thousands of computational tasks, which can take considerable time and compute power to execute. Recently approved by the GA4GH Standards Steering Committee, the Task Execution Service (TES) API v1 provides a standard mechanism for orchestrating these complex analyses across different compute environments.
To support large scale federated analysis, institutions and organizations employ queueing systems that send tasks out to high performance computers (HPCs) or cloud environments—but each compute system is unique, and each cloud vendor uses incompatible APIs for running batch tasks. Because of these discrepancies, researchers carrying out federated analysis must employ unique code for each.
The TES API adds to the suite of standards produced by the GA4GH Cloud Work Stream, whose mission is to help the genomics and health community take full advantage of modern cloud environments by bringing algorithms to data that cannot be moved due to various regulatory limitations.
“By building analysis software with the TES API, researchers can quickly move from a university cluster, to Amazon, to Microsoft Azure, without changing their code,” said Kyle Ellrott, Assistant Professor of Computational Biology at Oregon Health & Science University and co-lead of the TES API development team. “With the TES API, moving large-scale batch computing between private computers and the cloud becomes seamless.”
On the backend, the TES API wraps around an institution’s HPC system or cloud environment, and then manages the deployment, scheduling, running, and clean-up of tasks while providing status updates and logging information back to the researcher.
For example, if a researcher is running genomic analysis pipelines, they may send out a thousand task requests, usually with the help of a workflow engine. The workflow engine, which may be custom made or from an existing software project, needs a way to talk to the local compute resources. The TES server accepts the requests, communicates with the local job queuing system, and tracks progress and output. This is all done in a single API that looks the same no matter what infrastructure manages the computational resources. Thus, the TES API provides a flexible and standardized approach to connect complex workflow engines to new compute systems—saving time and resources.
Furthermore, the TES API can help extend systems that provide the Workflow Execution Service (WES) API, another GA4GH Cloud standard. While the WES API orchestrates a series of steps in a workflow, the TES API can connect the workflow to a compute backend to execute specific steps. So when a researcher takes their WES-enabled workflow engine to a new computational environment, they can plug into the local TES API without having to write new adaptors.
“This concept of pluggable compute backends is key to the TES API,” said Ania Niewielska, Lead Software Engineer at EMBL-EBI and co-lead of the TES API development team. “Since many existing workflow engines have already implemented the TES API, adding support for a new compute backend, such as a new cloud provider, can be achieved through a single TES implementation—instead of writing separate implementations for each workflow engine. Additionally, TES backends can be implemented in the technology of choice, independent from the tech stack used for the workflow engines.”
“The European life science infrastructure is very fragmented,” said Alexander Kanitz, co-lead of the ELIXIR Cloud project, a GA4GH Driver Project. “The TES API offers a means to abstract over different compute backends in an effort to federate the execution of computational workflows across the various nodes, from hospitals to research centers. This is one of the key reasons why we chose to implement the TES API.”
Joris Vankerschaver, Manager of Strategic Technologies and Life Science Solutions at Enthought said, “The TES API allows us to ‘code against TES’ rather than against a particular environment. This is important when working with clients who may have a mixture of on-premise servers and cloud resources, or who are interested in moving to the cloud.”
The TES API was also designed with real-world constraints in mind. “The complexity of moving health-generated data for secondary research purposes is immense, due to patient privacy, security, and legal considerations,” said Leslie Glass, Project Manager at EMBL-EBI where she leads the CINECA project. “We chose to implement the TES API to help ensure that these data do not become siloed and inaccessible for research.”
Many workflow engines, including Cromwell, Nextflow, and Snakemake, have already begun to support the TES API. In the future, the team plans to expand support for the API and to focus on compatibility with other GA4GH standards, including the Data Repository Service (DRS) API and the GA4GH Passports Specification to manage authentication and authorization.