Learn how GA4GH helps expand responsible genomic data use to benefit human health.
Learn how GA4GH helps expand responsible genomic data use to benefit human health.
Our Strategic Road Map defines strategies, standards, and policy frameworks to support responsible global use of genomic and related health data.
Discover how a meeting of 50 leaders in genomics and medicine led to an alliance uniting more than 5,000 individuals and organisations to benefit human health.
GA4GH Inc. is a not-for-profit organisation that supports the global GA4GH community.
To guide our collaborative, globe-spanning alliance, GA4GH relies on a Standards Steering Committee and an Executive Committee.
The Funders Forum brings together organisations that offer both financial support and strategic guidance.
The EDI Advisory Group responds to issues raised in the GA4GH community, finding equitable, inclusive ways to build products that benefit diverse groups.
Distributed across four Host Institutions, our staff team supports the mission and operations of GA4GH.
Curious who we are? Meet the people and organisations across six continents who make up GA4GH.
More than 500 organisations connected to genomics — in healthcare, research, patient advocacy, industry, and beyond — have signed onto the mission and vision of GA4GH as Organisational Members.
These core Organisational Members are genomic data initiatives that have committed resources to guide GA4GH work and pilot our products.
This subset of Organisational Members whose networks or infrastructure align with GA4GH priorities has made a long-term commitment to engaging with our community.
Local and national organisations assign experts to spend at least 30% of their time building GA4GH products.
Anyone working in genomics and related fields is invited to participate in our inclusive community by creating and using new products.
Wondering what GA4GH does? Learn how we find and overcome challenges to expanding responsible genomic data use for the benefit of human health.
Study Groups define needs. Participants survey the landscape of the genomics and health community and determine whether GA4GH can help.
Work Streams create products. Community members join together to develop technical standards, policy frameworks, and policy tools that overcome hurdles to international genomic data use.
GIF solves problems. Organisations in the forum pilot GA4GH products in real-world situations. Along the way, they troubleshoot products, suggest updates, and flag additional needs.
NIF finds challenges and opportunities in genomics at a global scale. National programmes meet to share best practices, avoid incompatabilities, and help translate genomics into benefits for human health.
Communities of Interest find challenges and opportunities in areas such as rare disease, cancer, and infectious disease. Participants pinpoint real-world problems that would benefit from broad data use.
See all our products — always free and open-source. Do you work on cloud genomics, data discovery, user access, data security or regulatory policy and ethics? Need to represent genomic, phenotypic, or clinical data? We’ve got a solution for you.
All GA4GH standards, frameworks, and tools follow the Product Development and Approval Process before being officially adopted.
Learn how other organisations have implemented GA4GH products to solve real-world problems.
Help us transform the future of genomic data use! See how GA4GH can benefit you — whether you’re using our products, writing our standards, subscribing to a newsletter, or more.
Help create new global standards and frameworks for responsible genomic data use.
Align your organisation with the GA4GH mission and vision.
Solve your real-world data problems with support from this valuable network of global institutions.
Work with like-minded groups committed to better data use in areas like rare disease, cancer, and infectious disease.
Share your thoughts on all GA4GH products currently open for public comment.
Solve real problems by aligning your organisation with the world’s genomics standards. We offer software dvelopers both customisable and out-of-the-box solutions to help you get started.
Learn more about upcoming GA4GH events. See reports and recordings from our past events.
Speak directly to the global genomics and health community while supporting GA4GH strategy.
Be the first to hear about the latest GA4GH products, upcoming meetings, new initiatives, and more.
Questions? We would love to hear from you.
Read news, stories, and insights from the forefront of genomic and clinical data use.
Attend an upcoming GA4GH event, or view meeting reports from past events.
See new projects, updates, and calls for support from the Work Streams.
Read academic papers coauthored by GA4GH contributors.
Listen to our podcast OmicsXchange, featuring discussions from leaders in the world of genomics, health, and data sharing.
Check out our videos, then subscribe to our YouTube channel for more content.
View the latest GA4GH updates, Genomics and Health News, Implementation Notes, GDPR Briefs, and more.
Discover all things GA4GH: explore our news, events, videos, podcasts, announcements, publications, and newsletters.
11 Jan 2022
This is episode 13 of the OmicsXchange where we will be discussing the Data Connect API produced by the GA4GH Discovery Work Stream.
Angela Page 00:06
Welcome to the OmicsXchange, I’m Angela Page. Earlier in 2021, the GA4GH Steering Committee approved a new standard, the Data Connect API produced by the GA4GH Discovery Work Stream. Today, I’m here to talk with the Data Connect technical team to learn more about this newly approved standard and how the community can leverage it to efficiently publish, explore, and search for biomedical data. I am joined today by Miro Cupak and Jonathan Fuerth from DNAstack, and by Aaron Kemp from Verily. So to get us started, what is the Data Connect API? Is it a specification, a protocol, a model? Tell us more.
Miro Cupak 00:46
Yeah, I think it makes sense for me to call it specification, makes sense to call it API, protocol probably as well—the same way Beacon is being referred to as a protocol. But yeah, certainly on the data model, like the basic difference for me is that, you know, pretty much any other API, probably actually, every other API in GA4GH that I came across, comes with a data model. It tells you, you know, this is how I represent the variant. This is what it looks like it has a chromosome field, it has a position field, and whatnot. This is what organization looks like it has a name, it has a URL and you know, a couple of other fields. So in the case of other standards, you would always have to do some sort of a transformation on your data to get it to the model that the API requires. Whereas in our case, the only thing that we asked you to do is to describe to us what you have, rather than transform it to something that we’ve picked, like that’s where the description part comes in.
Angela Page 01:42
Is that part of the benefit of it part of the beauty of it, that you’re able to connect things together that you wouldn’t be able to because they’re more prescriptive, and so if you kind of go in through this sort of higher state, you can bring things in, that aren’t necessarily meeting the specifications of Beacon or other API’s.
Aaron Kemp 02:01
Yeah, I would describe that as the primary advantage here, right. Like the whole point of this effort is to give people a way to expose their data the way it sits right now, without having to go through a heavy lift to harmonize it, it’s really about providing a common way to expose that data, right? It’s not as sophisticated as you might think, right? Like, it’s really a fairly straightforward statement that like, this is a way that you can expose your data, and other implementations of the standard can query it in a way that is understandable, both by machines and people,
Angela Page 02:33
I guess, that seems really awesome and necessary, but it seems like too good to be true. How is it that you can take something so complex as all of these different ways that people are describing data, or whatever it is that they’re doing out there in this crazy technical world of yours? And, you know, make it talk to each other? It seems like that’s kind of what GA4GH’s mission is on, you know, all together? So how it sounds as simple?
Aaron Kemp 03:02
Yeah, I think part of it is that we started from a premise of like, let’s use the tools that exist, right? So instead of inventing query language, we said, we’ll use SQL. Instead of inventing a new way of presenting a schema, we said, we’ll use JSON schema. So it’s not like there’s no work to be done, right? I mean, people still, when they expose their data, there’s a couple of options, right? I mean, one option is to let the machines do the work and expose the data, according to the underlying logical schema, that’s an option, but it doesn’t get you very far. A better option is to sort of start with that, and then layer on top of that the actual semantic meaning of the underlying data. And that’s where we prescribe things like let’s go and try and use existing things like Schemablocks, where appropriate, you know, other things like that. So it’s not a magic bullet, by any sense. It’s more about trying to get a couple of things in place so that it’s feasible to start doing this sort of thing where it’s like, I will run a query across three implementations data connect, and I will get some data back, then I can take the next step of figuring out what it might mean to harmonize those result sets.
Miro Cupak 04:10
Yeah, and the harmonization is gonna keep going, like, you know, when we’re really talking about the benefits of the API, like, I totally get you, Angela, like it kind of makes it sound like, “Okay, this is the silver bullet that just solves all the problems.” But it comes with trade-offs, right? Like we’ve specifically made the decision to make it as easy and as flexible for data providers to expose their data. But the trade-off is that we’re not explicitly harmonizing this data to come and make a model. So you could argue that on the data consumer side, you know, there’s a little bit more work that you have to do to like figure out how to consume the data. Whereas you know, with something like Beacon where they give you a specific model, you just, that’s what you get when you come to the API and it already comes prepackaged in the form. So it’s just we just chose a different things to optimize towards, then maybe most other APIs.
Angela Page 04:58
Could that also be considered a criticism of it, that if it’s too high level?
Aaron Kemp 05:03
Yes, it is a criticism we’ve heard multiple times. And we’re kind of okay with that. It’s like, we’re not trying to fix the entire world, we’re trying to get some incremental progress along this axis.
Jonathan Fuerth 05:16
It’s kind of like…I’ve always avoided this analogy because of biblical analogies into things. But it’s like the Tower of Babel story, it’s like everyone at the GA4GH is trying to build the tallest tower. And they’re definitely all speaking different languages. And that is definitely a problem. And this really what we’re doing with Data Connect is saying, Hey, how about if we all just speak the same language? We’re not designing the tower, we’re not proposing what materials it should be made of? It’s like, well, at least if we all like, talk about things using the same words that would help. So it is a really basic thing, it doesn’t sound as ambitious as like building the tallest tower, because it isn’t, I think we’re aligned as like in a data connect design is that it’s necessary to at least start, like using the same words for things as each other. And this is a way of doing that.
Aaron Kemp 06:03
Yeah, I almost would back up your analogy one step further, and say, this isn’t even about using the same words, it’s about deciding that we should speak as opposed to like, flashing cards at each other or, you know, jumping up and down or whatever, right? Like we’re, we’re basically saying, “This is how we’re going to communicate, we are not solving the what we’re going to talk about part.”
Angela Page 06:22
Okay, so a couple of questions here. What are some really concrete examples or questions that a data consumer could ask to learn more about the data? And what can the data user now do with that response?
Miro Cupak 06:34
Yeah, so we are listing a few queries in like the specification as well. But basically, you know, because we’re not prescribing the data or the form of the data, the queries largely depend on what data is set to expose. So you could in principle ask queries around like core selection, you know, like, if your data contains samples, you can ask questions around like, you know, find me samples that have a specific mutation in this specific gene, and have been diagnosed with this specific condition. But also, if your data is maybe more file-oriented, maybe you’re using DRS as a representation of your files, you know, that’s a completely different type of data than the sample data. Maybe you’re asking questions like, find me DRS objects that represent VCF files and are related to this study. So it’s a really large variety of very different questions that depends on the underlying data in your particular implementation of this particular data set. Does that make sense?
Angela Page 07:31
It does. Is it specific to data though? Like you mentioned, DRS, but not TRS—you couldn’t be searching for tool types, if there was metadata around the tools using Data Connect, is that true?
Miro Cupak 07:44
Yeah, you could be searching for jobs. In fact, that is a request that somebody has raised. That’s like, “yeah, find me, it was actually a little bit of things like find me nodes that I can execute this task on.”
Aaron Kemp 07:58
On that point, I would just like to say like, I think that’s an important thing to try and drive home is that you could implement the half of data connect that doesn’t have queries, and it’s still really useful. Like even if you didn’t want to do the query support at all, just exposing a Data Connect endpoint that describes the data stores that you have. And what they look like, and lets you retrieve that data is lightyears ahead of where we are. Being able to query on top of that is like sugar, like icing on the cake. If we can even just get people exposing the first part, it will be a big one.
Angela Page 08:30
So can you actually describe the use case there? So like, why would people want to do that without being able to search?
Aaron Kemp 08:39
Just because, you know, you could do sort of limited search on top of it, even not using the standard, right? Like, imagine that I’ve exposed the 1000 sample variant database that I have somewhere, I’ve exposed the table that has metadata about these things, 1000 is a small enough number that I could actually just get that data, and then sort of explore on my own using whatever tooling I might want to use, but at least it’s in a standard format that it can be retrieved as opposed to right now, it’s sitting in some CSV or in some database or whatever, in a way that I can’t even discover it.
Jonathan Fuerth 09:17
Yeah, like, for example, like the NIH dbGaP databases, it has a whole bunch of open data. And it’s, it’s even described, but it’s described in these XML files that have only ever been used to describe dbGaP datasets. So there aren’t any generic tools out there, like no open-source tools or commercial tools that understand that format, and can help you find datasets of interest based on what they contain. Even though that, like all those data dictionaries are sitting out there. But if those datasets were indexed using or published using the Data Connect standard, then a standard tool that can work with datasets from other places would also be able to help you find datasets in dbGaP that have data that’s of interest to, to whatever you’re studying.
Angela Page 10:06
But there’s a lot of data in dbGaP. Does that mean that it all has to be like retrofitted? Like every one of those datasets has to be re-described in some way in order for this for dbGaP to implement this API?
Jonathan Fuerth 10:20
Yes, it’s a mechanical transformation of those. But those XML files they already have that describe the data that can be transformed into JSON schema, easily, like we wrote a script just as a proof of concept to try that on a few datasets. And it took less than a day, I mean, to make it work well across all of them might be like a two or three-day project. But it’s not, it’s not an expensive thing to do. And that would be true of many, many datasets that they all have a way of describing themselves. But it’s all each dataset does it differently. And so what like the gap for the community is, we can’t write a tool that goes out and comprehensively figures out where is all the data, and what does it contain? What does it mean, but if they were all published in a format, where they talk about their concepts, using data connect, say, This is what I have, these are the columns in this data set and so on, then, then it becomes possible, right? All sorts of useful tooling for the community that can help people find and understand,
Angela Page 11:22
This has been fantastic. I’m so excited. Thank you, all of you for your time.
Aaron Kemp 11:26
Thank you so much, Angela.
Thank you for listening to the OmicsXchange, a podcast of the Global Alliance for Genomics and Health. The OmicsXchange podcast is produced by Alexa Frieberg and Stephanie Li with music created by Rishi Nag. GA4GH is the International Standards Organization for genomics aimed at accelerating human health through data sharing. I’m Angela Page, and this is the Omics Exchange.