23 October 2020
Angela Page: Welcome to the OmicsXchange, I’m Angela Page. In today’s episode, I’m speaking with Max Barkley, software developer and team lead at DNAStack, on the GA4GH Connection Demos. This initiative aims to demonstrate interoperability through real-world implementations of GA4GH standards, across multiple institutions. Max initially got involved in GA4GH several years ago through the DURI Work Stream, and he now co-leads the Federated Analysis System Project, or FASP, which is responsible for setting up the GA4GH Connection Demos. Welcome Max.
Max Barkley: Thank you for having me.
Angela Page: So to start us off, what is the main goal of the FASP?
Max Barkley: The main goal of FASP, at a high level, is to take all of these individual standards from different Work Streams that are solving really important, but very specific problems, and to show implementers, Driver Projects, data custodians, people who could benefit from them, how they work together in a cohesive system, as part of a real world solution. A common bit of feedback that I heard from a lot of people who were very enthusiastic about GA4GH, but they didn’t know what the entry point was, and it was hard for them to see how some of these standards wo uld fit into existing systems that they had. So that’s really the main goal of FASP—building demos that showcase how these things can work together. We’re trying to pave the road for these Driver Projects, implementers, and data custodians so that they see a solution that could work for them.
Angela Page: So the FASP initiative really started after the GA4GH 7th Plenary in 2019, where you showcased the “Golden Demo,” which is the predecessor to the GA4GH Connection Demos. What was the Golden Demo, and what was it trying to accomplish?
Max Barkley: So the golden demo was a demo originally co-developed by DNAStack and Google Cloud. It’s a demo to show GA4GH standards being used together in y a full vertical solution. So it ties together a bunch of different standards at the entry level. It uses the emerging Discovery Search standard, and that’s for Data Discovery to find what samples or patients that would be relevant for your research. It also uses search to do cohort selection to find a subset of samples that you want to run an analysis on. Then it uses the WES standard for actually executing a workflow and the DRS standard for obtaining the sample data to run inside that workflow. So really, the demo is showing how you can tie all these individual things together into a full vertical solution that can solve what we think is a pretty realistic use case of going from cohort selection to compute in the cloud.
Angela Page: At this year’s 8th Plenary, the FASP team showcased the 2020 Connection Demos—what are these, and how do they build upon what you did last year?
Max Barkley: So we have three demos. The first one is just the update of our Golden Demo. We’ve updated this demo in two key ways. First, is we’ve enhanced the DRS support to keep up with the 1.1 spec release. And another really exciting addition is that we’ve added multi cloud support. The previous golden demo was entirely running on Google Cloud, both the inputs that it consumed, and the actual compute; this demo, the compute is still performed on Google Cloud, but now it is consuming inputs from both Google Cloud and AWS S3 buckets. And all of these inputs are controlled access and are protected by an authorization system that is at its core based on the GA4GH passport standard. So GA4GH passports are used for representing researcher identity, but also the actual access policies in the system that decide who is able to access data are written in a language of GA4GH passport visas. So our second demo is the reproducibility demo. So where our golden demo is a vertical demo that shows many standards working together, the reproducibility demo is a horizontal demo; it shows a smaller set of standards, but across multiple organizations, right. So the primary standard that you’ll see in all of the stacks that have participated in this demo is the WES API for running workflows. And in this demo, different institutions have used their WES API’s to do a GWAS analysis on top of the 1000 Genomes dataset. The third demo that I think is, in a sense, really, the thing that we’re going to springboard off of going into 2021 is based off of work that came out of our hackathon that’s been championed by Ian Fore at NIH. And this is a demo that I’ve sort of called a “criss-cross” demo. So where we have a vertical demo and a horizontal demo, Ian has done amazing work writing Python scripts in the hackathon we had earlier this year, that are calling into GA4GH APIs in deployments across different institutions to run a workflow. So it’s really exciting because I think this is the future direction that we want to be going in. And this is really the beginnings of we’re at a point where there are enough functional implementations of GA4GH standards out there that you can start to show these criss crosses and show use cases that are spanning across multiple vendors’ infrastructure.
Angela Page: And is this the big goal? To have more standards working together across even more institutions and systems?
Max Barkley: Ultimately, you know, we want to get to the point where there aren’t three demos, where there’s one giant criss cross demo that is both very tall and wide, right? Capturing a full solution to a real researcher problem using multiple standards and those standards are cutting across multiple implementations in organizations. So that’s, you know, really the apex the dream that we’re there we’re trying to move towards. And I’m really excited about this last demo from Ian as really a very positive concrete step in that direction.
Angela Page: So one of the other really valuable things that FASP provides is feedback into the Work Streams about the standards that have been approved and how they are working or not working for the institutions that are implementing them. So what lessons has the FASP team learned so far?
Max Barkley: Well, there’s been really a lot of learning over the past year. But I think the biggest area for me, I would consider to be around authorization, which I think was a large part of why this group was formed and is continued to be one of the most challenging parts of building real world demonstrations of GA4GH standards. We know that real world systems, you know, they have genomic data that often require Data Access Committees, to obtain controlled access to that data. And so it’s very important that when you’re building a system that it’s protecting the integrity and security of the data. And largely building a single system for a single dataset is a solved problem. We know how to do this and we know how to do it reproducibly. But building a system that can make data securely accessible through multiple systems across multiple organizations in the public and the private cloud—you know, that’s a much more challenging problem and I think that’s where a lot of the interesting work in FASP comes from and what we’re trying to help people figure out.
Angela Page: How can the community take advantage of these demos when they’re trying to implement GA4GH standards?
Max Barkley: There are steps. The first step is implementing the standards, and what you’re gonna get from that is hopefully, that researchers can start to use these standards and at least have portability between these different datasets, in the kinds of tools they use. I think that on its own, you know, that introduces value; it’s not the full value, but it is valuable enough that I would encourage people to do this. But once you start getting into imagining pulling multiple datasets together and more complex authorization scenarios, it becomes a lot more complicated, and it really becomes a matter of what kind of trust models are acceptable for your organization. And for a long time, there’s still going to be a need for institutions to consciously make agreements with each other—to have certain levels of trust. The infrastructure that you build to enable those, if you use GA4GH standards, leaves the door open for expanding levels of lesser trust, and for things like registered access that the DURI workstream has been working on for a long time where where you can expose less identifiable data at a lower level of trust to more people, right. So it’s a multi tiered thing, the first step, you need to implement the standards to give people portability and to have something else to build off of. The second step really is deciding how broadly can you share your data and start working with potential partners and consumers to make sure that you’re implementing standards that can be used together. And that’s where the FASP demos can be, you know, a guide to you—as a way that you could operate in that kind of an environment.
Angela Page: What do you hope the community will take away from these demos?
Max Barkley: So from data custodians, we want them to look at these demos we’re showing and see that the standards are maturing. They’re getting very close to the point where you can use them in real systems. There’s still some corners that need to be shaved off, but I think that they’re ready for you to decide to adopt in your organization and to invest the effort in building those systems. We’ve demonstrated they can solve the problems that your organizations have. But from researchers, we want you to see this and, and first of all, tell us if you think that there’s something missing, there’s a problem or a research task that you do that you don’t know how these standards would help you. But more importantly, you know, if you think that they these are tools that you could work with, consider becoming a champion in your organization, trying to encourage people to explore systems that have been deployed, to see if they would be useful for your research, and to keep an eye out over 2021 as more of these systems come online, and, consider investing some time trying out the systems as alternative tools to what you’re currently using.
Angela Page: To our listeners – to learn more and get involved, you can visit www.ga4gh.org/2020-connection-demos. And lastly Max, what do you have planned in 2021 and beyond?
Max Barkley: You know today we have three demos. Two of them are more fleshed out this vertical demo and this horizontal demo, but ultimately, the goal is really to have cohesive crisscross demo. So in 2021, we’ll be continuing to work to promote interoperability between different GA4GH implementations, and it’s going to be very exciting as you know, we know that there will be many implementations that we expect to come online from various GA4GH Driver Projects that year. So we’ll continue to promote that, to work on more of these demos to test out how these systems work together, and to give feedback to these organizations on where things don’t quite work together and how they can improve that portability. And ultimately, we’ll move towards a world where there isn’t a need for separate vertical and horizontal demos in toy environments, where we’re getting to a point where there’s just real demonstrations of production systems that exist in the wild that researchers can use.
Angela Page: Thank you so much for speaking with me today, Max. It was really a great conversation.
Max Barkley: It was my pleasure. Thank you.
Thank you for listening to the OmicsXchange—a podcast of the Global Alliance for Genomics and Health. The OmicsXchange podcast is produced by Stephanie Li and Caity Forgey, with music created by Rishi Nag. GA4GH is the international standards org for genomics, aimed at accelerating human health through data sharing. I’m Angela Page and this is the OmicsXchange.