17 December 2019
The GA4GH Steering Committee has approved the Data Repository Service (DRS) API, a standardized set of access methods that are agnostic to cloud infrastructure. The DRS API completes the suite of approved APIs from the GA4GH Cloud Work Stream, which work together to allow researchers to discover algorithms across different cloud environments and send them to datasets they wish to analyze.
Currently, the process for retrieving data from a repository is complex and inefficient. Repositories have become crowded with files of data. In order to analyze remote genomic data, consumers must first retrieve files using the available “access tools”—tools that port data to an environment suitable to conduct analyzes. However, the access tools available to consumers are not guaranteed to work with the data they want, so a desired dataset may never reach its intended recipient or be used as input for downstream analyses.
“With this process, data providers drop their data into a sort of ‘ocean of files’ that float around aimlessly,” said Cloud Work Stream Co-Lead, Brian O’Connor, who is Director of the Computational Genomics Laboratory at the University of California, Santa Cruz.
The DRS API addresses this problem, allowing consumers to access data regardless of the underlying architecture of the repository in which it is stored. DRS gives a dataset a unique ID mapping to one or more access methods for retrieving it. Since DRS clients can access data from any DRS web service, the API reduces the need to create new access tools.
“DRS provides a generic interface for data repositories that enables access to data in a single, standard way, so the ocean of data files becomes an organized file cabinet,” said Cloud Work Stream Co-Lead, David Glazer, who is the Engineering Director at Verily.
The API has been implemented in the Broad Institute Terra Data Repository, Seven Bridges Cancer Genomics Cloud, Seven Bridges Cavatica, and the iRODS Consortium—cloud platforms where researchers can access, share, and analyze biomedical data. In the coming months, the Cloud Work Stream plans to generate a DRS registry on its GitHub workspace, and improve the integration features between DRS and the existing Workflow Execution Service (WES) API. This would allow users to first access a dataset via DRS, then use it as input for a workflow.
Overall, the DRS API serves as a bridge between access tools and otherwise siloed data, and saves data consumers time and effort. The Cloud Work Stream suite of APIs work together to ease the process of sharing and accessing data, and increase the volume of analyzed data.
“We developed these technical standards that make it easier for researchers to use and reuse data that has already been collected,” said Glazer. “In doing so, the research community will be less wasteful with the deluge of genomic data that exists today, as we will finally be able to harness formerly inaccessible datasets through interoperable access tools and platforms.”