11 January 2022
Angela Page 00:06
Welcome to the OmicsXchange, I’m Angela Page. Earlier in 2021, the GA4GH Steering Committee approved a new standard, the Data Connect API produced by the GA4GH Discovery Work Stream. Today, I’m here to talk with the Data Connect technical team to learn more about this newly approved standard and how the community can leverage it to efficiently publish, explore, and search for biomedical data. I am joined today by Miro Cupak and Jonathan Fuerth from DNAstack, and by Aaron Kemp from Verily. So to get us started, what is the Data Connect API? Is it a specification, a protocol, a model? Tell us more.
Miro Cupak 00:46
Yeah, I think it makes sense for me to call it specification, makes sense to call it API, protocol probably as well—the same way Beacon is being referred to as a protocol. But yeah, certainly on the data model, like the basic difference for me is that, you know, pretty much any other API, probably actually, every other API in GA4GH that I came across, comes with a data model. It tells you, you know, this is how I represent the variant. This is what it looks like it has a chromosome field, it has a position field, and whatnot. This is what organization looks like it has a name, it has a URL and you know, a couple of other fields. So in the case of other standards, you would always have to do some sort of a transformation on your data to get it to the model that the API requires. Whereas in our case, the only thing that we asked you to do is to describe to us what you have, rather than transform it to something that we’ve picked, like that’s where the description part comes in.
Angela Page 01:42
Is that part of the benefit of it part of the beauty of it, that you’re able to connect things together that you wouldn’t be able to because they’re more prescriptive, and so if you kind of go in through this sort of higher state, you can bring things in, that aren’t necessarily meeting the specifications of Beacon or other API’s.
Aaron Kemp 02:01
Yeah, I would describe that as the primary advantage here, right. Like the whole point of this effort is to give people a way to expose their data the way it sits right now, without having to go through a heavy lift to harmonize it, it’s really about providing a common way to expose that data, right? It’s not as sophisticated as you might think, right? Like, it’s really a fairly straightforward statement that like, this is a way that you can expose your data, and other implementations of the standard can query it in a way that is understandable, both by machines and people,
Angela Page 02:33
I guess, that seems really awesome and necessary, but it seems like too good to be true. How is it that you can take something so complex as all of these different ways that people are describing data, or whatever it is that they’re doing out there in this crazy technical world of yours? And, you know, make it talk to each other? It seems like that’s kind of what GA4GH’s mission is on, you know, all together? So how it sounds as simple?
Aaron Kemp 03:02
Yeah, I think part of it is that we started from a premise of like, let’s use the tools that exist, right? So instead of inventing query language, we said, we’ll use SQL. Instead of inventing a new way of presenting a schema, we said, we’ll use JSON schema. So it’s not like there’s no work to be done, right? I mean, people still, when they expose their data, there’s a couple of options, right? I mean, one option is to let the machines do the work and expose the data, according to the underlying logical schema, that’s an option, but it doesn’t get you very far. A better option is to sort of start with that, and then layer on top of that the actual semantic meaning of the underlying data. And that’s where we prescribe things like let’s go and try and use existing things like Schemablocks, where appropriate, you know, other things like that. So it’s not a magic bullet, by any sense. It’s more about trying to get a couple of things in place so that it’s feasible to start doing this sort of thing where it’s like, I will run a query across three implementations data connect, and I will get some data back, then I can take the next step of figuring out what it might mean to harmonize those result sets.
Miro Cupak 04:10
Yeah, and the harmonization is gonna keep going, like, you know, when we’re really talking about the benefits of the API, like, I totally get you, Angela, like it kind of makes it sound like, “Okay, this is the silver bullet that just solves all the problems.” But it comes with trade-offs, right? Like we’ve specifically made the decision to make it as easy and as flexible for data providers to expose their data. But the trade-off is that we’re not explicitly harmonizing this data to come and make a model. So you could argue that on the data consumer side, you know, there’s a little bit more work that you have to do to like figure out how to consume the data. Whereas you know, with something like Beacon where they give you a specific model, you just, that’s what you get when you come to the API and it already comes prepackaged in the form. So it’s just we just chose a different things to optimize towards, then maybe most other APIs.
Angela Page 04:58
Could that also be considered a criticism of it, that if it’s too high level?
Aaron Kemp 05:03
Yes, it is a criticism we’ve heard multiple times. And we’re kind of okay with that. It’s like, we’re not trying to fix the entire world, we’re trying to get some incremental progress along this axis.
Jonathan Fuerth 05:16
It’s kind of like…I’ve always avoided this analogy because of biblical analogies into things. But it’s like the Tower of Babel story, it’s like everyone at the GA4GH is trying to build the tallest tower. And they’re definitely all speaking different languages. And that is definitely a problem. And this really what we’re doing with Data Connect is saying, Hey, how about if we all just speak the same language? We’re not designing the tower, we’re not proposing what materials it should be made of? It’s like, well, at least if we all like, talk about things using the same words that would help. So it is a really basic thing, it doesn’t sound as ambitious as like building the tallest tower, because it isn’t, I think we’re aligned as like in a data connect design is that it’s necessary to at least start, like using the same words for things as each other. And this is a way of doing that.
Aaron Kemp 06:03
Yeah, I almost would back up your analogy one step further, and say, this isn’t even about using the same words, it’s about deciding that we should speak as opposed to like, flashing cards at each other or, you know, jumping up and down or whatever, right? Like we’re, we’re basically saying, “This is how we’re going to communicate, we are not solving the what we’re going to talk about part.”
Angela Page 06:22
Okay, so a couple of questions here. What are some really concrete examples or questions that a data consumer could ask to learn more about the data? And what can the data user now do with that response?
Miro Cupak 06:34
Yeah, so we are listing a few queries in like the specification as well. But basically, you know, because we’re not prescribing the data or the form of the data, the queries largely depend on what data is set to expose. So you could in principle ask queries around like core selection, you know, like, if your data contains samples, you can ask questions around like, you know, find me samples that have a specific mutation in this specific gene, and have been diagnosed with this specific condition. But also, if your data is maybe more file-oriented, maybe you’re using DRS as a representation of your files, you know, that’s a completely different type of data than the sample data. Maybe you’re asking questions like, find me DRS objects that represent VCF files and are related to this study. So it’s a really large variety of very different questions that depends on the underlying data in your particular implementation of this particular data set. Does that make sense?
Angela Page 07:31
It does. Is it specific to data though? Like you mentioned, DRS, but not TRS—you couldn’t be searching for tool types, if there was metadata around the tools using Data Connect, is that true?
Miro Cupak 07:44
Yeah, you could be searching for jobs. In fact, that is a request that somebody has raised. That’s like, “yeah, find me, it was actually a little bit of things like find me nodes that I can execute this task on.”
Aaron Kemp 07:58
On that point, I would just like to say like, I think that’s an important thing to try and drive home is that you could implement the half of data connect that doesn’t have queries, and it’s still really useful. Like even if you didn’t want to do the query support at all, just exposing a Data Connect endpoint that describes the data stores that you have. And what they look like, and lets you retrieve that data is lightyears ahead of where we are. Being able to query on top of that is like sugar, like icing on the cake. If we can even just get people exposing the first part, it will be a big one.
Angela Page 08:30
So can you actually describe the use case there? So like, why would people want to do that without being able to search?
Aaron Kemp 08:39
Just because, you know, you could do sort of limited search on top of it, even not using the standard, right? Like, imagine that I’ve exposed the 1000 sample variant database that I have somewhere, I’ve exposed the table that has metadata about these things, 1000 is a small enough number that I could actually just get that data, and then sort of explore on my own using whatever tooling I might want to use, but at least it’s in a standard format that it can be retrieved as opposed to right now, it’s sitting in some CSV or in some database or whatever, in a way that I can’t even discover it.
Jonathan Fuerth 09:17
Yeah, like, for example, like the NIH dbGaP databases, it has a whole bunch of open data. And it’s, it’s even described, but it’s described in these XML files that have only ever been used to describe dbGaP datasets. So there aren’t any generic tools out there, like no open-source tools or commercial tools that understand that format, and can help you find datasets of interest based on what they contain. Even though that, like all those data dictionaries are sitting out there. But if those datasets were indexed using or published using the Data Connect standard, then a standard tool that can work with datasets from other places would also be able to help you find datasets in dbGaP that have data that’s of interest to, to whatever you’re studying.
Angela Page 10:06
But there’s a lot of data in dbGaP. Does that mean that it all has to be like retrofitted? Like every one of those datasets has to be re-described in some way in order for this for dbGaP to implement this API?
Jonathan Fuerth 10:20
Yes, it’s a mechanical transformation of those. But those XML files they already have that describe the data that can be transformed into JSON schema, easily, like we wrote a script just as a proof of concept to try that on a few datasets. And it took less than a day, I mean, to make it work well across all of them might be like a two or three-day project. But it’s not, it’s not an expensive thing to do. And that would be true of many, many datasets that they all have a way of describing themselves. But it’s all each dataset does it differently. And so what like the gap for the community is, we can’t write a tool that goes out and comprehensively figures out where is all the data, and what does it contain? What does it mean, but if they were all published in a format, where they talk about their concepts, using data connect, say, This is what I have, these are the columns in this data set and so on, then, then it becomes possible, right? All sorts of useful tooling for the community that can help people find and understand,
Angela Page 11:22
This has been fantastic. I’m so excited. Thank you, all of you for your time.
Aaron Kemp 11:26
Thank you so much, Angela.
Thank you for listening to the OmicsXchange, a podcast of the Global Alliance for Genomics and Health. The OmicsXchange podcast is produced by Alexa Frieberg and Stephanie Li with music created by Rishi Nag. GA4GH is the International Standards Organization for genomics aimed at accelerating human health through data sharing. I’m Angela Page, and this is the Omics Exchange.