Data Connect: a come-as-you-are approach to data sharing

10 Jan 2022

The GA4GH Standards Steering Committee approved the Data Connect API, a standard to support federated search of disparate datasets.

The GA4GH Standards Steering Committee approved the Data Connect API, a new standard to support federated search of disparate datasets.

No one project, organization, or even country has the volume or diversity of data required to understand or treat many complex diseases. However, the healthcare and biomedical research ecosystems are powered by a large number of data generators who store their data in many locations, formats, and structures, depending on their intended purpose, workflows, and analyses. 

“Biomedical data covers so many disciplines, technologies and diseases, that even as we standardize in one area, there remains huge diversity between fields,” said Ian Fore, Senior Biomedical Informatics Program Manager at the National Cancer Institute’s Center for Biomedical Informatics & Information Technology and a member of the Data Connect Product Review Committee. “Addressing a complex scientific problem may require working with data from an unfamiliar field of science. Data Connect’s ability to describe unfamiliar data helps scientists bring relevant data into their analysis.”

The independent production of biomedical data without specific guidelines and procedures has resulted in inconsistent and incomparable datasets, which need to be engineered to work together.

“Researchers who need to combine data from different systems are faced with two challenges: lack of data harmonization and technical differences in how to pull data from each system,” said Jonathan Fuerth, Director of Engineering at DNAstack and a contributor on the Data Connect development team. “Data custodians could work together to transform the data into a common model, but this can take significant time and resources. As a parallel option, if people who share data can describe the concepts already in their data, this information will be more accessible to a broader community and have great impact.” The broader community also brings more hands to help with harmonization.

Enter the new Data Connect API, developed by the GA4GH Discovery Work Stream, which provides a simple and flexible mechanism to connect researchers to information about otherwise disparate datasets, adding to the group of standards to facilitate the discovery and utilization of varied data sources and services. The Data Connect API prevents a lack of harmonization from being a barrier to data sharing. Data Connect works best when data are harmonized to specific terminologies, models, or data elements. Where harmonization is absent, Data Connect enables description of the data “as is” in a way that makes it easier for a scientist to apply it to their specific purpose.

To see what Data Connect does, take the example of an autism researcher looking to better understand the genetic underpinnings of a phenotypic trait such as repetitive behavior across three different datasets. For the purposes of this example, we will assume each dataset uses a different data type and format; for example, one may be in a relational database, such as PostgreSQL, another in a proprietary API like Google Sheets, and a third in a structured data file such as an EMR extract. Within each of these datasets, there is similar data—genotypic, phenotypic, and physical attributes. The Data Connect API allows researchers, who are looking to find relationships across these datasets, to find individuals relevant to their research question and co-analyze the data.

As Dean Hartley, Senior Director of Discovery and Translational Science at Autism Speaks, said, “Most autism data resides on many independent servers and there isn’t a good way for researchers to combine these data, let alone find this data. Data Connect is a critical part for researchers to search for the data and combine them to get the statistical power they need, or at least the preliminary data, to support their hypothesis. I believe that the Data Connect API will help fulfill this need and support their discovery.”

How it Works

Data Connect consists of two components. The first piece is a common JSON schema for describing and organizing data and its underlying model, allowing data providers to describe the form of their data. Many standards prescribe data models, meaning the data providers need to transform their data to conform to the model. The Data Connect API, however, does not prescribe a data model; it asks data providers to describe their data using the common JSON schema. “This allows data providers to expose more of their data in the way it currently sits, without needing to go through a heavy lift to harmonize all their data,” said Aaron Kemp, Senior Staff Engineer at Verily and co-lead of the Data Connect development team.

The second component of Data Connect is a method for writing flexible, custom queries that allow users to learn more about a dataset of interest in a standardized, actionable format. This aspect of the API allows a researcher to ask a variety of questions. If they want to study a cohort, they could ask, “find me samples that have a specific mutation in this specific gene, that have been diagnosed with this specific condition.” If the data is more file-oriented, they could ask, “find me data objects that represent VCF files and are related to my study.” One could even ask, “find me nodes that I can execute this task on.”

The two components of Data Connect are independent of one another, and one can implement the first without the second. “Regardless of whether you want to add support for the query component, just exposing a Data Connect endpoint is useful to describe how your data is stored and what they look like, and enable a researcher to retrieve them,” said Kemp.

“Data Connect allows data custodians to describe the semantics of their data without prescribing a particular model, and data consumers to interrogate it through complex questions. This makes it uniquely suited to be a way to access and combine data. Thanks to this standard, a client application using multiple types of data from multiple systems only needs to be able to talk to one API,” said Miro Cupak, Co-founder and Chief Technology Officer at DNAstack and co-lead of the Data Connect development team. “If I want to pull up information such as subject ID, date of birth, and cancer type for all patients in a Phenopacket collection and all patients in a separate FHIR collection, I can do that. Data Connect opens up an ecosystem of tools for performing simple, reusable computations on various datasets, enabling federated search and analysis in a way not previously possible.”

Hear the Data Connect development team discuss the Data Connect API on the OmicsXchange Podcast:
View Transcript >

Impact on the Community

The community has built a variety of applications on top of the Data Connect API, allowing researchers to use platforms such as data explorers, Jupyter Notebooks, command-line interfaces, and R data frames to interrogate data that were originally generated across multiple formats, such as FHIR, Phenopackets, VCF, CSV, and more. 

A number of driver projects and other consortia have implemented the API to enable sharing of their data, including National Cancer Institute Cancer Research Data Commons, Autism Sharing Initiative, and the COVID Cloud consortium. Other initiatives, such as VariantMatcher, have started the implementation process as well.

The Data Connect API is also interoperable with other GA4GH standards, particularly the cloud and data access standards used in the GA4GH Connection Demos developed by the Federated Analysis Systems Project (FASP). Additionally, other specialized APIs, such as the GA4GH Beacon API can be implemented on top of Data Connect to pull in more types of data and leverage an ecosystem of tools around Data Connect.

Related Work Streams
Related Products

Latest News

GA4GH logo and McGill's Victor Phillip Dahdaleh Institute of Genomic Medicine logo, with a sentence saying "Announcing GA4GH's Fifth Host Institution"
20 Jun 2024
The Victor Phillip Dahdaleh Institute of Genomic Medicine at McGill University is formally named as a fifth Host Institution of GA4GH
See more
13 Jun 2024
A letter from the Chair: introducing the members of the inaugural GA4GH Strategic Leadership Committee
See more
A DNA strand extending across a blue background, filled with molecular structures and more DNA
28 May 2024
GA4GH submits comments on the WHO’s draft principles for human genome access, use, and sharing
See more