24 April 2020
Angela Page: Large-scale sequencing initiatives around the globe are generating massive volumes of clinical genomic data that have the potential to inform research into human health and disease. But controlling access to these data is currently a cumbersome endeavor. Researchers must be properly authorized and their intended use must match the datasets’ consented use. The GA4GH Data Use and Researcher Identities Work Stream aims to create the standards required for efficient access control. The group has now released two standards that help streamline and automate this process: Data Use Ontology and GA4GH Passport Specification. For this episode, we’re speaking with the product leads of these independent deliverables, Melanie Courtot and Craig Voisin, about the challenges presented by data access control and the solutions they’re developing within their teams. Melanie is a coordinator for metadata standards and the archival infrastructure and technology team at EMBL-EBI. She got involved with GA4GH in 2016 and now co-leads development of the data use ontology standard. Craig is a software engineer at Google where he leads a team that provides federated access solutions across public clouds. He joined GA4GH in 2018 and co leads development of the GA4GH passport standards. Thank you both so much for speaking with us today.
Craig Voisin: Thank you very much
Melanie Courtot: It’s a pleasure to be here with you.
Angela Page: To get us started, what does the landscape of data access look like right now?
Melanie Courtot: So right now, unfortunately, the process to get access to data is really, really long. It takes between two to six weeks to actually retrieve the data you may be interested in. Once you have found the dataset you are interested in, you need to request access to the data from a Data Access Committee, which is the organization managing the access rights. It means You need first to identify and get approval from somebody in your institution with an institutional signing official and that person is the legal responsible for your usage of the datasets. Then you need to write a description for your project, and how you intend to use the data, agree to terms and condition, and this is all done manually in plain text at the moment.
Craig Voisin: Just to build on what Melanie was saying, there are multiple stages of data access that can be challenging. So one type of request is being able to discover what datasets are out there that are useful to apply to, especially when it takes two to six weeks to apply for a dataset, it’s very helpful to know should I spend the effort it takes to apply to these ten data sets or not? Should I focus down on these other ones? And then there’s another type of access which is, “I would like to have full read access, I’d like to be able to read the bytes, plug it into my analysis pipeline so that I can determine whether or not my hypothesis holds true for a particular variant or I’m trying to analyze, you know, a particular cohort to see what they have in common.” And so now I need full read access to the data, and this has the tightest data governance controls, especially because it could expose personally identifiable information like genome sequences. And once a researcher has applied for direct read access to the bytes, they’re even more hurdles to access it. Where do I need to go to even get a login to be able to access the data? Where’s the data located? What tools and protocols do I use to read it? None of these issues have anything to do with the data format or the data itself. They’re merely prerequisites to be able to access the data. So these are examples of different types of access, but there are even more flavors that are possible based on the institution or even the governing bodies across countries that have certain policies in place.
Angela Page: What about the Data Access Committees, or DACs. What challenges do they face?
Melanie Courtot: The Data Access Committee—they’re really the gatekeepers to data access, so they’re the group of people that have to check the requests that come in can be matched against the data use limitation on datasets. So they have to evaluate what the request is about versus what the dataset says, and they have to do that manually. So it’s all about interpreting the form, analyzing the form, and making a judgment as to whether access should be granted or not. This is a very important step that helps safeguard the patient’s privacy, the data confidentiality, while ensuring the usage of the data is done in compliance with the original consent forms from the patient. This takes time and expertise—and as we mentioned earlier, once you found the dataset and you are sending your data accessing, it can take two to six weeks just to get the data sheet back.
Craig Voisin: And just to add to that as well, these are the kind of timeframes it takes today. But as we know, the number of datasets is growing, and the number of applications to access that data is growing as well. So the scalability of the DAC becomes also a concern, as we onboard twice as many researchers as before, how do we make this scale so that the DAC can keep up with this workload?
Angela Page: Both of you are involved with the GA4GH Data Use and Researcher Identities work stream, or DURI. What are this group’s objectives and goals?
Melanie Courtot: Well, you’ve heard from both Craig and I about some of the issues that a researcher are facing when they’re trying to access data today. The Data Use and Researcher ID work stream or DURI, ends up streamlining that process. Improving data availability will ultimately increase its reusability. So our role within the Global Alliance for Genomics and Health is really designing specifications that institutions can apply consistently across. So the idea is not so much for us to develop tooling to do this—because that’s very much an institution-specific decision—but more describing the desired behavior of those tools. And if we all agree on what we’re trying to do and how to do it, at the end, the way it’s actually physically and technically implemented doesn’t matter so much because you know what the input and what the outputs are. And together, we believe that this is actually the only way the wealth and the amount of data the genomics community is generating can be efficiently leveraged to improve human health.
Craig Voisin: I think it’s great to see the size and engagement of the global community that’s trying to make a difference here. Automation can bring standardization to the process and the controls, and that would really help researchers focus on the research, not the data access process.
Angela Page: So what does automation mean in this context of data access, and why is this the approach that the DURI work stream is focusing on?
Melanie Courtot: As Craig mentioned, there’s really more data coming our way and more people wanting to access the data. So if you start multiplying the number of data access committee per request, by the number of requests, by the number of people wanting to make those requests, and you imagine that each of those request has to follow a different process with a different type of form using a different tool, this is becoming a really big hindrance to actually reusing the data in any sort of efficient way. So one of the things that’s worth highlighting when we talk about automation, we are not talking about a hundred percent fully automatic process. We know that the privacy and respecting the consent that a patient gave us at the start of the process is really critical in sharing data worldwide. However, when we look at the type of access requests and the type of restriction on the datasets nowadays, 80% of those could be standardized and help triage access requests for the Data Access Committee. So the goal is not to automate absolutely everything, the goal is really to make the load on those Data Access Committee lighter.
Craig Voisin: Automation can streamline the application process to produce faster results for getting access, it can reduce some of the risks related to incorrect use by reducing how much interpretation is done based on text and making it easier to go through this process. And it can remove some of the access barriers to actually conduct the research . So all those things help as well as the scalability that’s coming as we continue to grow how many researchers are accessing data.
Angela Page: So how would this all work—what would the ideal data access process look like?
Melanie Courtot: Well, if we are doing our work very well, nobody should know about it. Researchers can do their research. They can search deposits, request data seamlessly from within their own favorite application. They don’t need to know about anything around data access, fetching the data from a specific location, logging in behind the scenes, should just come together for them. And that happens in less than 24 hours. And this would allow researcher to really focus on their research without having to deal with any of the technical hurdles in getting the data.
Angela Page: Melanie, I know you lead development for the Data Use Ontology—or DUO–standard. Can you explain how DUO fits into this vision?
Melanie Courtot: So we talked a little bit about the need to manually assess textual access requests and how that’s a heavy process, both for the person who’s filling in both forms to request data and for the Data Access Committee who has to interpret them, so the Data Use Ontology or DUO provide a set of terms that are stable and unambiguous to represent those data access limitation. And they can be used at different stages in the process. So when you’re trying to find a dataset of interest, ideally you retrieve only those datasets that you can actually have access to. And then you can, for example, restrict your search for datasets that are compatible with a specific research proposal—so for example, if you are a diabetes researcher you can see each dataset was consented for diabetes specific research, which is a subtype of a DUO disease-specific research. And finally, the last step in the process, when a researcher comes and requests that datasets, they can very clearly very unambiguously say I want to do diabetes-related research. And by using that same Data Use Ontology term to describe the same thing, you suddenly don’t have the need for that interpretation of plain text form anymore. The Data Access Committee can really be confident that they understand the meaning behind it. So each of the term in DUO has a human readable definition as well, and the idea is very much standardizing the understanding and getting that shared understanding of DUO.
Angela Page: So the goal is to arrive at these unambiguous, stable terms, as you mentioned. What is the process to develop these terms?
Melanie Courtot: That’s exactly right. We had a lot of discussion, let me take an example. What does not-for-profit means? Does that mean using it for not-for-profit? Does that mean you are part of a not-for-profit organization? How does not-for-profit relate to non-commercial? So we’ve made one term that encompasses all those cases. And when research or the deposited data use that term, the Data Access Committee can really be confident that they really understand the meaning behind it. So each of the term in DUO as a human readable definition as well, and the idea is very much standardizing the understanding and getting that shared understanding of DUO. And as you’re saying, even if we don’t fully automate the process, it’s much easier if you’re part of a Data Access Committee, and you say, “Oh, this is a diabetes restricted dataset, Oh, that person wants to study diabetes,” it’s very clear-cut that you need to let that through and it’s not going to take you one or two hours reading text and trying to understand the meaning behind the text. Furthermore, we distribute DUO as machine readable. So these separate tools automatically can compare the DUO terms on the Access Request versus DUO terms on the dataset and make a determination should you be granted access or not and propose that determination to Data Access Committees. And I should add that DUO is also being applied right now in real life production systems. For example, as of today, the European Phenome Genome Archive, EGA, has around 500 datasets with the DUO terms, and based on this success, EGA is now requiring that DUO terms be added on all new submissions, whether they’re coming through CRG in Spain or EMBL-EBI in the UK.
Angela Page: So Craig, you lead the development of the GA4GH Passport standard. Could you tell us a bit more about this and how it fits into the data access process?
Craig Voisin: Passports are about improving the quality and consistency of controls that are in place for access across the industry, while at the same time streamlining the process to prove that it was fit for use. I think it’s worth talking a little bit about the analogy here. If I’m working within my home organization, there’s probably a way that my organization has me identify myself—a consistent way that I can get access to various pieces of data or my inbox or whatever needs to be done. Where the complication comes in is when you start crossing over environments—where you start going cross-organizations, or cross-platforms, or these kinds of things. And it’s a bit like if I need to go to my corner store around the corner, I don’t need to carry a passport with me in order to do that, because it’s all within my home area. But as soon as I cross borders, there’s going to be a new process that’s in place to determine whether it’s fit for me to be able to access that country for example, in the physical form. And so we’re just trying to carry that analogy into things for researchers. So a researcher passport—a GA4GH passport—allows an individual to identify themselves and carry their permissions with them between their locations and environments. And we do this by playing on that analogy of a visa within a passport. As a researcher, I can build up different visas in my passport, they can identify me as a bona fide researcher, which is a general term, it can identify my home organization and be signed off digitally, that can be verified that these things are genuine. And then as Data Access Committees, grant my access requests, I can then build up a set of controlled access grants that give me read access to those datasets.
Angela Page: Right. And I can imagine there’s an enormous impact from the researcher side of having a universal digital identification that works everywhere.
Craig Voisin: The more people that are on this global like GA4GH passport network, the more opportunities there are to streamline these more complex use cases. And that means that you can actually combine data responsibly in ways that it hasn’t been before to make new discoveries. The GA4GH passport standard was just ratified in October 2019, and we’re already seeing a lot of work that’s being done to enable discovery and read access on datasets. In fact, there are over 5000 datasets today that use passports in some capacity to modernize and streamline access verification. I would expect to see the number of datasets and the methods to acquire the appropriate access to grow considerably across the global community in 2020. So it’s gonna be very exciting to see.
Angela Page: How do you see these two standards working together to improve the data access process?
Melanie Courtot: One thing we need to do this year is trying to marry those two processes, to really align them. At this point in time, we have the pieces. And it’s really about bringing them and fitting them together. We’re really trying to make sure that on the one hand side, you can handle authentication and authorization via passports. And on the other hand side, you have a process for the data access committee to review automatically DUO-tagged elements. So it’s going to be really exciting to see how we can really merge the two processes and having one common workflow for researchers to get access to the data.
Craig Voisin: I think it takes a lot of community involvement, because obviously there are different practices in place across the globe. And but it’s going to be an interesting space because once we know the DUO codes, what is involved with each individual dataset, we can both streamline the the DAC process to be able to get rights to it, but also carry some of this content inside the passport itself as proof and evidence for access. So it can be used directly potentially, to be able to get access to certain types, different levels of access. But I think overall, what we’re going to see first probably is more about streamlining the DAC process to get a visa that says the researcher should have read access to the dataset. So I suspect that’s where a lot of this will start and then we’ll evolve that based on what needs are across the community.
Angela Page: So how will these standards and this whole automation process improve the research endeavor as a whole?
Melanie Courtot: At the end, it’s all about making the data flow faster to improve its reuse and eventually analyses and the applications to make discoveries from those data that we can integrate from multiple institutes and sources. And you know, this is really the only way we think you can have a lasting impact on human health—you put the data together, merge it, integrate it, analyze it seamlessly. You don’t need to be concerned about the technicalities. Where is it coming from? What format is it in? You can all have access to data and really use it for your research.
Craig Voisin: Well, I look forward to accelerating research while putting even more controls and transparency in place for the data donors and participants. Now we start this journey with researchers themselves. But I’d like to see a day when we see the same technologies be used in clinical settings with patients. The system can provide participants and patients direct access to their own data and provide mechanisms to consent for research use, and then see where their own data has been used across research projects. This can encourage even more people to step forward and make huge impact on the health and well being of us all.
Thank you for listening to the OmicsXchange—a podcast of the Global Alliance for Genomics and Health. The OmicsXchange podcast is produced by Stephanie Li and Caity Forgey, with music created by Rishi Nag. GA4GH is the international standards org for genomics, aimed at accelerating human health through data sharing. I’m Angela Page and this is the OmicsXchange.