12 May 2020
Angela Page: Welcome to the OmicsXchange. I’m Angela Page. The urgency of scientific data sharing is never more apparent than during a global disease outbreak. This episode marks the first in a series of conversations on the role of data sharing during the COVID-19 pandemic. We’ll spend the next few weeks speaking with members of the international genomics community about new initiatives that leverage collaboration, interoperability, and open science to advance research into the novel coronavirus. Today we hear from Mark Fiume, CEO of DNAstack and co-lead of the GA4GH Discovery Work Stream about the COVID-19 Beacon, an initiative aimed at making viral genomic datasets discoverable for investigators around the world. Welcome, Marc.
Marc Fiume: Thank you so much for having me. It’s wonderful to speak about open data sharing in a time where the world needs it most.
Angela Page: So tell us a little bit about the COVID-19 Beacon. What is the focus of this project?
Marc Fiume: The COVID-19 Beacon is a search engine across publicly available virus genomes that visualizes for a given mutation, the geographic and evolutionary origins of that sequence. We’re updating the beacon nightly to include more genome sequences as they’re published around the world, and they’re at over 5,600 genome sequences. The COVID-19 Beacon builds upon the latest version of the Beacon specification being led by Jordi Rambla at ELIXIR, and a team of international collaborators within GA4GH and specifically the Discovery Work Stream.
Angela Page: And why is this important during a global pandemic, in a time where so many regions of the world are in crisis?
Marc Fiume: I think one of the really interesting observations of COVID-19 and our response to it is that it kind of reminds you of what your priorities are. At home you call your parents and hug your kids. And at work, it has really helped us to recognize the order of our priorities when we approach science collaboratively. I think it’s been really great to see open science have increased priority over some other things that could potentially block data sharing like attribution, for example. So it’s really, really important that we share data as real time as we can, nationally and internationally, to increase the volume and diversity of information that scientists have to look at the virus and the host that it’s affecting.
Angela Page: How did the project get off the ground?
Marc Fiume: As things were escalating with COVID-19 outbreak, we thought, you know, GA4GH develops a lot of standards for open data sharing and my group focuses primarily on data discovery—so essentially creating search engines for biomedical data, whether it’s for genome types or moving into the space of phenotypic and other metadata. And so Beacon is the simplest. It’s sort of this genomic “Go Fish,” where you can ask a Beacon, “Do you have information about this variant yes or no?” And optionally, a data provider can tell you more. So in the context of human genetics, you might say, here are the phenotypes associated with patients who have this variant, breast cancer, for example. But the standard is agnostic to what the payload or what that metadata can include. So the payload is the package of information that flows along with the Beacon response. So the question is now, is there a viral sequence with this mutation? Yes or no? So we repurpose the payload to include information about the viral sequence from which the data was found: things like geography and where was that virus found in the world. Or what evolutionary strain does this map to. And it gives scientists a better understanding about how the virus is mutating, where it comes from, and how it might be evolving, as it transmits itself around the world.
Angela Page: So is this one data set that has lit a Beacon or is it multiple data sets coming together under one hood?
Marc Fiume: So the Beacon Network is actually a search engine across over 100 beacons that are distributed worldwide and they’re good reasons for those beacons being distributed across the world—sometimes it’s the easiest way to create an implementation. But you know, quite often there are regulatory restrictions to where human data can reside. The COVID-19 Beacon is slightly different in architecture as it stands today in that we’ve centralized the dataset by ingesting viral genome sequence data from multiple data sources, and hosting a single Beacon that can be searched. Later, we may see other viral genome sequence Beacons be lit, but currently there is only one and it’s aggregating sequence information from multiple sources.
Angela Page: How is this initiative helping to accelerate COVID-19 research?
Marc Fiume: There are a number of ways that analyzing this kind of information can inform research and public policy. The first one is in strain identification that is helping to understand which strain of the virus someone has. This can be very helpful with diagnostics and treatment. The second one is in transmission route tracing, that is helping us to understand which strains of the virus are being transmitted from where to where, and this is helpful in understanding the effectiveness of public policy and where we need to enforce more stringent guidelines. Next is mutation rate—-that is we can understand the genetic makeup of the virus as it’s evolving. through local clusters, now that we’re socially isolating and physically restricting how far I can move because of travel bans. Next one is molecular structure, we can learn about the physical conformation of proteins that the virus is creating and whether it’s possible to repurpose drugs for similar- looking proteins produced by viruses like SARS, which have 97% sequence similarity to the novel Coronavirus. And the last one is more prospective, it is in future connectivity to host genome information. So we expect that genomics, clinical, and phenotypic data from human subjects that are affected by the virus will become more readily available. And it’s possible to link these to specific strains and understand what are the underlying genetic risk factors and prognostic indicators for a specific strain.
Angela Page: Do you have insights into how the scientific community is using the COVID-19 Beacon?
Marc Fiume: There are multiple applications that we could imagine for the COVID-19 Beacon. You can use it to track evolutionary strains, transmission routes, mutation rates, potentially co-infections, if we were to add clinical metadata. The Beacon is meant to start the conversation around real time global distributed data sharing and we plan to increase both the power of the COVID-19 Beacon, but also launch a suite of other implementations of GA4GH standards that serve the same or similar kinds of data in different ways. So the Beacon itself right now has been queried a couple thousand times and we’re getting new feature requests on how we could add either more functionality or information to the payload to help scientists do science like, how do how do these mutations affect the protein sequence and the 3D conformation of the proteins in which these these variants lie. Another really interesting use of the Beacon and it’s always amazing to hear ways people are using software that you don’t really anticipate. We heard from one expert in Genome Assembly that is sort of playing Humpty Dumpty with genome sequence data, piecing these virus genomes back together from the raw data in order to come up with a linear sequence. We’ve heard all of these developers tell us that they’re using Beacons to sort of sanity check or debug their algorithms for doing the assembly, so that we can standardize methods for COVID-19 virus genome assembly and potentially other virus genomes in the future. So, we’re finding out, you know, how the community is using the Beacon in addition to use cases. We don’t yet have research results. But you know, I could say that there are ambitious initiatives being started to do virus genome sequencing at scale. it’s going to be very important too for localities and regions and provinces and states and countries to be tracking the local evolution of the virus because you know, there are travel bans and so whatever strain of the virus is affecting our community might be unique to that community. So I think it’s going to be really interesting to understand how we could use data sharing technologies to track the local evolution of the virus and help us design more tailored treatments and diagnostic tools.
Angela Page: How does this project fit into larger overarching themes within GA4GH more broadly?
Marc Fiume: One of the one of the goals of the GA4GH has been to make data flow as pervasively as it does on the internet, constrained by, you know, recognizing that genomics data is highly sensitive and identifiable. And so we need to develop a suite of additional layers, a suite of standards that deal with that complexity, that we don’t have with information flowing across the open web. And so as the GA4GH matures, it’s now in a position where we can set up integrations of multiple standards from across work streams. So one of the initiatives we’re working on is called FASP, or Federated Analysis Systems Project. And it brings together Discovery, DURI, and Cloud Work Stream standards to realize the working integration for distributed discovery, access, and analysis on the cloud, and I think that’s one of the holy grails for our field is how do we collectively learn from the wealth of information that’s being generated around the world, while respecting the constraints around regulatory and privacy? I think if we can do that, we’ll break through a lot of barriers and make discoveries really quickly. So without question, I think COVID-19 Beacon will aim to sit into the FASP mold in that we’re looking at the host genome, for clues about why some people are asymptomatic versus why people develop severe illness and die from COVID-19. And when we go in that direction, it’ll be really important for us to develop systems that prospectively consent individuals to share their data to individuals who have the right research purpose and intent, during a time where the world really needs that access to data as quickly as possible, I think that’s going to really help accelerate science.
Angela Page: This is obviously a global health crisis in the truest sense, and many millions of people are being affected. How do you hope the COVID-19 Beacon will offer benefits to human health and medicine over time?
Marc Fiume: COVID-19 requires urgent attention. But we believe we will beat this. And at some point, we’ll have to return to normal. I think there’s an opportunity here in the time of crisis to mount a response, you know, much like an immune system, so that this doesn’t catch us off guard again. And so I think this is definitely raising the importance of population genome sequencing, and I hope that it inspires, health systems and countries to kind of follow the lead of other countries, like what we see with All of Us and Genomics England and Australian Genomics Health Alliance, to really look at population sequencing as a way to create a data resource that allows us to tap into scientific discoveries in times of need.
Angela Page: Well said. Thank you so much for speaking with us today, Marc.
Marc Fiume: Thank you guys so much.
Thank you for listening to the OmicsXchange—a podcast of the Global Alliance for Genomics and Health. The OmicsXchange podcast is produced by Stephanie Li and Caity Forgey, with music created by Rishi Nag. GA4GH is the international standards org for genomics, aimed at accelerating human health through data sharing. I’m Angela Page and this is the OmicsXchange.