Learn how GA4GH helps expand responsible genomic data use to benefit human health.
Learn how GA4GH helps expand responsible genomic data use to benefit human health.
Our Strategic Road Map defines strategies, standards, and policy frameworks to support responsible global use of genomic and related health data.
Discover how a meeting of 50 leaders in genomics and medicine led to an alliance uniting more than 5,000 individuals and organisations to benefit human health.
GA4GH Inc. is a not-for-profit organisation that supports the global GA4GH community.
To guide our collaborative, globe-spanning alliance, GA4GH relies on a Standards Steering Committee and an Executive Committee.
The Funders Forum brings together organisations that offer both financial support and strategic guidance.
The EDI Advisory Group responds to issues raised in the GA4GH community, finding equitable, inclusive ways to build products that benefit diverse groups.
Distributed across four Host Institutions, our staff team supports the mission and operations of GA4GH.
Curious who we are? Meet the people and organisations across six continents who make up GA4GH.
More than 500 organisations connected to genomics — in healthcare, research, patient advocacy, industry, and beyond — have signed onto the mission and vision of GA4GH as Organisational Members.
These core Organisational Members are genomic data initiatives that have committed resources to guide GA4GH work and pilot our products.
This subset of Organisational Members whose networks or infrastructure align with GA4GH priorities has made a long-term commitment to engaging with our community.
Local and national organisations assign experts to spend at least 30% of their time building GA4GH products.
Anyone working in genomics and related fields is invited to participate in our inclusive community by creating and using new products.
Wondering what GA4GH does? Learn how we find and overcome challenges to expanding responsible genomic data use for the benefit of human health.
Study Groups define needs. Participants survey the landscape of the genomics and health community and determine whether GA4GH can help.
Work Streams create products. Community members join together to develop technical standards, policy frameworks, and policy tools that overcome hurdles to international genomic data use.
GIF solves problems. Organisations in the forum pilot GA4GH products in real-world situations. Along the way, they troubleshoot products, suggest updates, and flag additional needs.
NIF finds challenges and opportunities in genomics at a global scale. National programmes meet to share best practices, avoid incompatabilities, and help translate genomics into benefits for human health.
Communities of Interest find challenges and opportunities in areas such as rare disease, cancer, and infectious disease. Participants pinpoint real-world problems that would benefit from broad data use.
See all our products — always free and open-source. Do you work on cloud genomics, data discovery, user access, data security or regulatory policy and ethics? Need to represent genomic, phenotypic, or clinical data? We’ve got a solution for you.
All GA4GH standards, frameworks, and tools follow the Product Development and Approval Process before being officially adopted.
Learn how other organisations have implemented GA4GH products to solve real-world problems.
Help us transform the future of genomic data use! See how GA4GH can benefit you — whether you’re using our products, writing our standards, subscribing to a newsletter, or more.
Help create new global standards and frameworks for responsible genomic data use.
Align your organisation with the GA4GH mission and vision.
Solve your real-world data problems with support from this valuable network of global institutions.
Work with like-minded groups committed to better data use in areas like rare disease, cancer, and infectious disease.
Share your thoughts on all GA4GH products currently open for public comment.
Solve real problems by aligning your organisation with the world’s genomics standards. We offer software dvelopers both customisable and out-of-the-box solutions to help you get started.
Learn more about upcoming GA4GH events. See reports and recordings from our past events.
Speak directly to the global genomics and health community while supporting GA4GH strategy.
Be the first to hear about the latest GA4GH products, upcoming meetings, new initiatives, and more.
Questions? We would love to hear from you.
Read news, stories, and insights from the forefront of genomic and clinical data use.
Attend an upcoming GA4GH event, or view meeting reports from past events.
See new projects, updates, and calls for support from the Work Streams.
Read academic papers coauthored by GA4GH contributors.
Listen to our podcast OmicsXchange, featuring discussions from leaders in the world of genomics, health, and data sharing.
Check out our videos, then subscribe to our YouTube channel for more content.
View the latest GA4GH updates, Genomics and Health News, Implementation Notes, GDPR Briefs, and more.
Discover all things GA4GH: explore our news, events, videos, podcasts, announcements, publications, and newsletters.
10 Jan 2022
The GA4GH Standards Steering Committee approved the Data Connect API, a standard to support federated search of disparate datasets.
The GA4GH Standards Steering Committee approved the Data Connect API, a new standard to support federated search of disparate datasets.
No one project, organization, or even country has the volume or diversity of data required to understand or treat many complex diseases. However, the healthcare and biomedical research ecosystems are powered by a large number of data generators who store their data in many locations, formats, and structures, depending on their intended purpose, workflows, and analyses.
“Biomedical data covers so many disciplines, technologies and diseases, that even as we standardize in one area, there remains huge diversity between fields,” said Ian Fore, Senior Biomedical Informatics Program Manager at the National Cancer Institute’s Center for Biomedical Informatics & Information Technology and a member of the Data Connect Product Review Committee. “Addressing a complex scientific problem may require working with data from an unfamiliar field of science. Data Connect’s ability to describe unfamiliar data helps scientists bring relevant data into their analysis.”
The independent production of biomedical data without specific guidelines and procedures has resulted in inconsistent and incomparable datasets, which need to be engineered to work together.
“Researchers who need to combine data from different systems are faced with two challenges: lack of data harmonization and technical differences in how to pull data from each system,” said Jonathan Fuerth, Director of Engineering at DNAstack and a contributor on the Data Connect development team. “Data custodians could work together to transform the data into a common model, but this can take significant time and resources. As a parallel option, if people who share data can describe the concepts already in their data, this information will be more accessible to a broader community and have great impact.” The broader community also brings more hands to help with harmonization.
Enter the new Data Connect API, developed by the GA4GH Discovery Work Stream, which provides a simple and flexible mechanism to connect researchers to information about otherwise disparate datasets, adding to the group of standards to facilitate the discovery and utilization of varied data sources and services. The Data Connect API prevents a lack of harmonization from being a barrier to data sharing. Data Connect works best when data are harmonized to specific terminologies, models, or data elements. Where harmonization is absent, Data Connect enables description of the data “as is” in a way that makes it easier for a scientist to apply it to their specific purpose.
To see what Data Connect does, take the example of an autism researcher looking to better understand the genetic underpinnings of a phenotypic trait such as repetitive behavior across three different datasets. For the purposes of this example, we will assume each dataset uses a different data type and format; for example, one may be in a relational database, such as PostgreSQL, another in a proprietary API like Google Sheets, and a third in a structured data file such as an EMR extract. Within each of these datasets, there is similar data—genotypic, phenotypic, and physical attributes. The Data Connect API allows researchers, who are looking to find relationships across these datasets, to find individuals relevant to their research question and co-analyze the data.
As Dean Hartley, Senior Director of Discovery and Translational Science at Autism Speaks, said, “Most autism data resides on many independent servers and there isn’t a good way for researchers to combine these data, let alone find this data. Data Connect is a critical part for researchers to search for the data and combine them to get the statistical power they need, or at least the preliminary data, to support their hypothesis. I believe that the Data Connect API will help fulfill this need and support their discovery.”
How it Works
Data Connect consists of two components. The first piece is a common JSON schema for describing and organizing data and its underlying model, allowing data providers to describe the form of their data. Many standards prescribe data models, meaning the data providers need to transform their data to conform to the model. The Data Connect API, however, does not prescribe a data model; it asks data providers to describe their data using the common JSON schema. “This allows data providers to expose more of their data in the way it currently sits, without needing to go through a heavy lift to harmonize all their data,” said Aaron Kemp, Senior Staff Engineer at Verily and co-lead of the Data Connect development team.
The second component of Data Connect is a method for writing flexible, custom queries that allow users to learn more about a dataset of interest in a standardized, actionable format. This aspect of the API allows a researcher to ask a variety of questions. If they want to study a cohort, they could ask, “find me samples that have a specific mutation in this specific gene, that have been diagnosed with this specific condition.” If the data is more file-oriented, they could ask, “find me data objects that represent VCF files and are related to my study.” One could even ask, “find me nodes that I can execute this task on.”
The two components of Data Connect are independent of one another, and one can implement the first without the second. “Regardless of whether you want to add support for the query component, just exposing a Data Connect endpoint is useful to describe how your data is stored and what they look like, and enable a researcher to retrieve them,” said Kemp.
“Data Connect allows data custodians to describe the semantics of their data without prescribing a particular model, and data consumers to interrogate it through complex questions. This makes it uniquely suited to be a way to access and combine data. Thanks to this standard, a client application using multiple types of data from multiple systems only needs to be able to talk to one API,” said Miro Cupak, Co-founder and Chief Technology Officer at DNAstack and co-lead of the Data Connect development team. “If I want to pull up information such as subject ID, date of birth, and cancer type for all patients in a Phenopacket collection and all patients in a separate FHIR collection, I can do that. Data Connect opens up an ecosystem of tools for performing simple, reusable computations on various datasets, enabling federated search and analysis in a way not previously possible.”
Hear the Data Connect development team discuss the Data Connect API on the OmicsXchange Podcast:
View Transcript >
Impact on the Community
The community has built a variety of applications on top of the Data Connect API, allowing researchers to use platforms such as data explorers, Jupyter Notebooks, command-line interfaces, and R data frames to interrogate data that were originally generated across multiple formats, such as FHIR, Phenopackets, VCF, CSV, and more.
A number of driver projects and other consortia have implemented the API to enable sharing of their data, including National Cancer Institute Cancer Research Data Commons, Autism Sharing Initiative, and the COVID Cloud consortium. Other initiatives, such as VariantMatcher, have started the implementation process as well.
The Data Connect API is also interoperable with other GA4GH standards, particularly the cloud and data access standards used in the GA4GH Connection Demos developed by the Federated Analysis Systems Project (FASP). Additionally, other specialized APIs, such as the GA4GH Beacon API can be implemented on top of Data Connect to pull in more types of data and leverage an ecosystem of tools around Data Connect.