10 June 2022
Angela Page 0:03
Welcome to the OmicsXchange. Today, I’m speaking with two members of the GA4GH Genomic Knowledge Standards Work Stream, or GKS. Larry Babb is a software engineer at the Broad Institute of MIT and Harvard, and Alex Wagner is an assistant professor at the Nationwide Children’s Hospital and the Ohio State University College of Medicine.
Larry and Alex are co-leads of the Variation Representation Specification, or VRS. This is an open-source, community-driven spec that lays out a computational framework for representing biomolecular variation. It enables computers to describe information about variation in order to support federated data exchange.
Welcome, Larry and Alex.
Alex Wagner 0:45
It’s a pleasure to be here.
Larry Babb 0:46
Yes. It’s a pleasure to be here. Thank you, Angela.
Alex Wagner 0:49
Thanks for having us.
Angela Page 0:50
Yeah, so I just read a whole bunch about the two of you. And I wanted to get to you as quickly as possible. And I left a couple of things out in the intro, which is exactly what is the Genomic Knowledge Standards Work Stream? And why are you guys involved? Why do you care about it? What brings you here?
Larry Babb 1:06
So, my involvement with the Genomic Knowledge Work Stream came about through my participation in the ClinGen project and the ClinGen grant. I work for Heidi Rehm, who is a voracious advocate for data-sharing, transparency, data exchange, all that kind of thing — related to genetic variation, and genetic test results, to improve the quality of clinical care around genomics. And I’ve worked for her for several years, close to 15 years now, actually. And it’s been her passion to have me do the best I can to help promote standards in data sharing.
Alex Wagner 1:45
Yeah, and I got involved with the Genomic Knowledge Standards Work Stream through my work with one of the Driver Projects at GA4GH, which is the Variant Interpretation for Cancer Consortium. Years ago, I was doing my postdoctoral research at Washington University, and my work focused on how we can harmonise content from different knowledge bases, which are used to interpret variants. So, I helped build this knowledge base called CIViC, which is a Clinical Interpretation of Variants in Cancer knowledge base.
But I also was concerned about the content in these various other knowledge bases, like OncoKB, and the Jackson Labs Clinical Knowledge Base, and things like that — bringing it all together and making it all be represented the same way.
And it turns out, that’s really hard. And it turns out that these different frameworks talk to each other very differently, even in the same domain. And so there was a lot of work that went into trying to harmonise across these resources.
And then from that, I learned that there’s actually quite a bit of work needs to be done more globally about how we described genomic knowledge, across all such knowledge bases — both in the somatic cancer space where I’m from, as well as over in, you know, the ClinGen’s germline space, which is more of what Larry focuses on.
And so through that work, I decided to get engaged in the Genomic Knowledge Standards Work Stream to help drive development of those standards. And eventually, I was invited on with Larry to co-lead the Variant Representation team and build out the VRS specification.
Angela Page 3:10
That’s great. Thank you both so much. And I think that’s a great segue into my first question, which is, what is the challenge? What is the Genomic Knowledge Standards Work Stream remit, and why does it exist? What challenge in the world is it trying to solve?
Larry Babb 3:23
So like I said, I’ve been working with Heidi and the clinical genetics community for about 15 years. And it was way back at the beginning of that time, when we were trying very hard to produce genetic test results, clinical genetic test results with interpreted classified variation, interpreted for a particular case — return it to a patient and a clinician in an electronic health record.
And when we pursued this endeavour, many years ago, it became very clear that both on the lab side in producing these classifications — these curated bits of knowledge around a particular kind of variation that may be pathogenic for disease — that there was a need to create a database where variants were abstracted away from the case-level data where you could generate and collect evidence about it, and then curate it to prove your classifications.
And there was really no model; there’s no standard out there. Now, at the beginning, we used standards that were “standards” like HGVS, which we’ll talk about maybe in a little bit, or ISCN. But there was really nothing that was standardised at that point.
So for years since, every time we have built one of these applications or pieces of software to exchange with these different communities, it’s become very clear that there needs to be some kind of data structure, some kind of format, where we can reliably send this information to each other and trust that it’s being done accurately, precisely.
Otherwise, clinical care will never benefit from it.
Angela Page 5:01
Yeah. So, that’s the question I really want to get out there, that last thing you said. How will it change clinical care to solve that problem that you’re talking about?
Alex Wagner 5:08
So, you know, right now, when we try and evaluate variants that we observe in the clinical setting — so I work at a children’s hospital, the Institute for Genomic Medicine. And when we try and solve a case here, you know, we have variant scientists who sit and evaluate the individual variants observed in a tumour, for example, and try and determine what their clinical significance is.
And this process takes them a very long time, because they’re in, they have to get into this headspace of: okay, what’s all the content of the databases I’m looking at? And how does that relate to the variants that we observed in our clinical assays? And they have to apply a lot of their own knowledge to match up those concepts.
And a big reason for this is that the computable format for representing that knowledge, and those knowledge bases, doesn’t really sync up with the sort of structured formats that already exist for variants that come out of these assays. And we’re all familiar with those things, like VCF and HGVS.
And so being able to kind of associate computationally what’s coming out of these machines, and these assays, with what’s in these knowledge bases is a big gap.
And what that means is, of course, that these analysts need to spend a lot of time thinking about how to connect those dots.
And I feel like what we’re trying to accomplish here is provide that framework by which a machine, or a clinical decision support system, can collect that evidence and apply it programmatically to the variants that you observe in an assay — and thus save those analysts that time of doing that matching, and they can spend more of their time focusing on kind of the validation of that matching.
Angela Page 6:38
That’s great, something you said in there, and I want to pick apart — and maybe this speaks to something Larry said also, something “we’re all familiar with”: HGVS and VCF.
So I’m not familiar with these things! I know them only because you guys have talked to me about them before. And I think that Larry talked about the landscape of standards that are out there.
Could you talk a little bit more about that and explain why this framework for harmonisation is kind of necessary? You know, we already have this standard that you guys have alluded to. What is that? And why do we need something else?
Larry Babb 7:10
Well, I want to make sure it’s clear here. And I think many people that are in this space know this. So we don’t often say it out loud. But this is a very big problem, right? You know, there’s billions and trillions of variants out there in the human genome, and in other organisms and whatnot.There’s different reference sequences that people use, different assemblies, different builds different genome builds, and different technologies that are used to call these assays.
Our job isn’t necessarily to get all the way upstream into how the assays do their calling, and eventually realise what variant it is that they’re finding in a patient sample.
Our job is, once that’s identified, taking all those different forms, however they may come down the pipe, and put them on to a normalised playing field where the structure and the semantics and the content is all done the same way.
There’s differences related to how people refer to the position of the bases. For example, in our standard, we use what’s called an interbase coordinate system — versus counting the position of where the actual residues live, we count the spaces in between. This is something some folks have done out in the industry, but it’s not necessarily widespread, per se. But this is something we’ve taken on.
And we worked really hard to sort of build from the ground up some of the basic small concepts of small sequence changes. But we’re also using this platform to expand it out into much more complex things, like gene fusions and structural variants and copy number variants. So it’s all built on the same kind of information model.
And with that information model and the tools to use it, we’re very sure, we’re very confident, I should say, that this will be something that will really help the field, make this data much more interoperable, and really open up the space for data sharing.
Angela Page 9:05
Alright, so now we’ve got a little bit of the landscape, I introduced in the beginning the Variation Representation Specification (VRS). So can you tell our listeners, what is this standard? And how does it work? What does it do?
Alex Wagner 9:19
Yeah, so VRS at its core is a way of representing the fundamental bits and bytes of variation, right? So, we start out with things, as you know, conceptually grounded as a genomic location, you know, interval or range on a sequence. And we call those sequence locations.
And then on top of that, we build more complex entities like alleles, where it’s the state at a given sequence location. And, you know, we can represent other forms of variation inside VRS currently, we can represent things like alleles, but also things like haplotypes, copy number variation.
And we try and distinguish the semantics of variation a little bit like by saying things like, the difference between the tandem duplication, where the same sequences repeated twice in a row, directly adjacent to each other, versus the sense of you’ve got two or three or four x copies throughout your system, your genome. And you know, making those semantic distinctions is a big important part of what we do.
And the way that we do it in VRS is through something we call value-object representation. And this is basically that minimal core computational representation where we are representing only those fundamental characteristics that describe it.
And to give you an example of what I mean, you can consider, for example, a street address. This is kind of the way that you might describe to another person where a building is right, you know: 123 Main Street. And that’s what we think of as, like, a label or maybe a registered identifier.
But the value object for that building might be something more like the actual latitude and longitude coordinates. It’s the fundamental characteristic of where that location is. It’s kind of immutable, that [location] doesn’t change over time — that place is always that place. And you can always reference it, as long as you’re using the same coordinate system.
And so this is the way that VRS kind of differs from other similar frameworks. The other ones, you know, might, for example, talk about things in more descriptive terms — that are more human intelligible, that might involve registration — whereas VRS is more about those fundamental constituent elements, the salient elements, of what constitutes a variant.
Angela Page 11:21
So, how is the community currently using the VRS standard? Is it, is it pretty widespread yet? Or? And what is the audience for this? Who are the people you’re trying to reach with the standard?
Alex Wagner 11:30
Yeah, I think that the primary audience, for people that are interested in VRS, or any of the other Genomic Knowledge Standards, are people that are trying, fundamentally, to connect variation to human knowledge about variation — and, you know, in particular, how you do that, with regard to different systems, different exchanges.
And where it’s being applied — you know, as Larry and I mentioned earlier, we each represent Driver Projects to the Global Alliance for Genomics and Health, and these are like real world data initiatives. And we’re bringing these shared concerns to the table and implementing VRS, and associated, you know, nascent GKS specifications, to help us with how we structure the content of resources like that found in ClinVar, or in CIViC, or in any of the other resources I mentioned earlier.
And so that’s primarily where we’re applying it. And then I think that there’s, of course, the broader suite of tools that we’ve developed to help facilitate community adoption. And among those, we have the VRS Python library, which is built off of a set of core utilities, called the BioCommons stack, which helps us with things like how we look up reference sequences with various identifiers of time, how we do things like translate between genomic assemblies, and things like that. And between all of these tools in that stack and the VRS Python package, we’re able to do a lot of the translation work we need for our own projects.
But more importantly, this makes VRS accessible to other groups as well. And one of the things we’re doing to try and help facilitate community engagement — and we just published on this specification last November — in order for us to help, you know, facilitate engagement, we are actually holding a hackathon this summer.
Larry Babb 13:12
I want to make sure it’s clear to the folks listening that you know, we’re not naive in thinking that this is a trivial thing we’re doing. And to some that might think or feel like, you know, we’re trying to boil the ocean in a way.
But the fact of the matter is, this is desperately needed. People are trying to do this regardless if we’re doing it or not.
I’ve worked many years with HL7 clinical genomics work groups that are very passionate, and you know, it’s necessary for them to figure out how to represent genes and variants and sequences as they share data across the wire to the FHIR protocols that are emerging.
And in fact, that was one of the things that we’ve set up here in GA4GH is to collaborate with them to some degree, and hopefully inform them with the standard that we develop here.
Ideally, that would happen as a result of us demonstrating that these tools and the systems that we’re piloting early on are truly effective, and have the promise to solve this problem in a long-term kind of a model.
So, Alex and I are leads on these two Driver Projects, him on VICC and myself, I’m on ClinGen. And we both have our little development teams here, where we’re working to build up and prove out that these tools and these standards work.
And at the end of the day, we hope by the end of this year, maybe by Plenary, we can bring and show folks how ClinVar and CIViC, what they would look like and how you would use them in this GA4GH-standardised form.
And finally, I’d like to sort of note that it’s not just VRS. VRS is not the whole solution here. We’re working with the other GKS teams, the refget team that does the sequence modelling and the Variant Annotation team that is working on sort of the structure of the knowledge statements themselves.
These three elements — the Variant Annotation, the Variant Representation, and the refget work — all work together to solve this problem of representing knowledge so that it can be computable, scalable, interoperable, and hopefully open up the floodgates to greater discovery.
Angela Page 15:28
That’s great, Larry, thank you. And I was thinking of something that I’ve heard you talk about before, which I hope I phrase this correctly, but sort of the commercial incentive behind this. You’re saying you’re not boiling the ocean; it’s a really big problem, but it’s really needed, and there’s value in doing this. So, how is this going to help, you know, the commercial clinical labs that you talk about all the time? Like, what’s in it for them?
Larry Babb 15:47
So in the past, I have worked for commercial labs. In fact, a product that we wrote, which was a product that enabled us to take clinical genetic test results from the lab and put it into an electronic form and send it over the wire to a provider.
One of the biggest hurdles that we had, one of the biggest challenges that that whole industry needed, was that connection of, how do we get this electronic health record with genetic data into the electronic health record system? And this lab report — you know, typically, today, what you’ll find out is, if you do any kind of research on this, you’ll find out that almost 100% of these reports are in some kind of a PDF form, or a fax image, or something like that.
And that’s almost useless to being able to discover and pull out the elements to figure out if there’s a correlation between certain demographic groups and certain variations in certain diseases.
And we really need to get this data in there in electronic form, not only to do discovery and research on this data, but also to be able to inform patients and doctors, when there’s updates made to those variants, in terms of the knowledge. New knowledge is found outside the system, and you want to report it back to people that already have these existing variants — you want to give them the best latest information in real time, so that those doctors can treat them properly.
Alex Wagner 17:13
There’s also a lot to be said about the people that are consuming this information and how it helps them, right? So, the other side of the, you know, commercial perspective: these are healthcare providers that are looking to consume data about genomic variation and use that knowledge to help drive clinical decision making.
And you know, as we move towards an ecosystem of, you know, federated knowledge providers, because the space of genomic knowledge is so large, you know, the ability to be able to pull that data in and use it in its native format — without going through all the data munging and data harmonisation exercises that are so challenging — opens the doors to allow community providers to benefit as much as their large academic hospitals that can afford these teams to do all of that harmonisation work.
So one of the big advantages to us building the standards by which these knowledge providers can do this is to provide the means by which possible knowledge consumers can collect that data and use it in community practice, and not just at large academic centres.
Angela Page 18:18
That’s really important. That leads me to another question sort of about the community. So you know, the three of us are all based in the US, and we’ve mentioned a few different Driver Projects that are international projects, but based in the US. How is the broader international community of GA4GH participating in GKS and in the VRS development? And you know, the medical system, the healthcare record, is different in every single country. So is this really transferable for international data exchange?
Larry Babb 18:44
I would say — my perspective on this just listening and watching some of the groups that we get to work with on the Global Alliance, over there in Europe in particular, that they actually have an advantage in a way. Because they are, most of them, working on national health care systems, where they get that sort of government and kind of support around the idea of structuring the data. So they’re not as conflated with all these different vendors pushing their different kinds of data exchange “standards” — Epic versus this versus that.
Now, I would say that those other groups in Europe have a much better likelihood of being able to stand up and and realise the value of GKS and VRS in particular. And I think they are. We’re engaged in VRS and VA with some groups at EBI that are looking to take some of the data that they produce and put it into a standard form using VRS. And we hope that that continues. You know, we’re always talking with different people on the ELIXIR project and some of the other groups over there. But I’ll see, maybe Alex has some other things you can add here?
Alex Wagner 19:51
No, I think that’s a great point. And to a broader perspective, you know, what we’re trying to build here with VRS, as we mentioned before — this is unlike things like, you know, VCF, for example, which has its own kind of file format that is, you know, structured and has its own parameters under which you can fill it out.
VRS designed to be a data object that you can stick into more complex documents and structures. And it fits in in a lot of different places and contexts. And what’s useful about that is that you can then put it into larger, more complex standards.
I mean, we’re talking about electronic health records, right? Another big GA4GH standard, recently came out, is the Phenopackets standard, right, version two. And VRS is part of that, right, we can help represent variation inside of Phenopackets — which is a larger, more complex data structure — as kind of something that you can slot in or mix in the larger picture.
Angela Page 20:39
I feel like I might finally be starting to understand VRS after having been talking to you guys about this for a few years now. Are you working with the Beacon v2 team as well in the way that they’re trying to turn up clinical information? Is there any overlap there?
Alex Wagner 20:55
Yeah, I’m glad you mentioned that. So Beacon also did adopt the VRS standard as one of its variation formats. So you can do things — they have the old Beacon v1 format, as well, which I think they call legacy variation, as well as VRS variation in there. So they do support both.
And what’s nice about Beacon is that, you know, it’s actually natively implemented the same framework that VRS is, which is JSON schema — which is a way that computers talk to one another in a structured way. And so VRS actually plays quite nicely with the Beacon v2 standard, which was recently approved.
And more than that, you know, you had asked about, you know, kind of the international community and other groups participating. And I’d forgotten to mention that, you know, that there are a number of Driver Projects that are very interested in adopting this.
So, like, our colleagues over in the SPHN, the Swiss Personalised Health Network, have a strong interest in adopting these standards as they mature, as do you know, some of our colleagues and other you know, large networks. You know, Larry mentioned these government-supported community hospital networks, for example, our colleagues over at ZPM, which is the Centers for Personalised Medicine in Germany; it’s part of the German national healthcare system.
So there are large national initiatives, I think, who have a strong interest in adopting and applying these standards as they develop. And I think that the the real challenge as we go forward is just, you know, making sure that we are communicating clearly with them about the where the standards are in terms of their development, how they can get engaged, and how they can help continue to build this community, and the tooling around it, so that they can have a robust standard that stands the test of time.
Angela Page 22:24
That’s fabulous, exactly where I was about to go next. How can people get involved in this? What is your call to action here for listeners?
Larry Babb 22:31
It’s great question, the call to action. I would say, you know, first of all, the first thing I would recommend is go out to the Global Alliance website and find VRS. We have our own specs that are up there at VRS.ga4gh.org. That would be a great starting point.
We have a nice article we wrote recently in Cell Genomics, it was published a few months ago, that is referenced there on the front page of the VRS spec you can link into and read that. That’ll give you some, you know, real technical details on what it is and where it’s going.
We’re constantly updating it. We have a draft branch of a VRS, you know, we’re always trying to improve and get out. Alex and I, right now, are heads down, very much heads down, in trying to develop and deliver a demonstrable set of real world projects through this CIViC data and the ClinVar data for ClinGen — and really get that formalised both in this nascent and early on VA spec that’s come out, but with our more mature VRS spec embedded in it. And we really would love to have that in hand and ready to distribute and share with the public by the Plenary in September.
For folks to get involved with us, we have meetings every other Monday, you can sort of reach out to GA4GH and sign up and get on that list. That’s a great way to do it.
But if anyone wants to go out to our VRS Python GitHub repo and download it, you know, I can tell you that Alex’s team is working on some interesting things to make it easier for people to come up to speed on it. Maybe, Alex, you can talk about some of the notebook stuff.
Alex Wagner 23:49
Yeah, so one of the things we’re developing — to help people kind of with the bridge from where they are currently to using this computable framework and VRS — are tools that allow for the translation of these different variants, representation formats into VRS. And we have, like, a wide variation normalisation service; we have, as I mentioned, this first Python library.
But for people that are just getting started, and just want to learn about VRS, it can be a bit of a lift to get from, you know, absolutely no information about it to, I’ve got this production, translation service running. And it involves, you know, spinning up these sequence databases and some translation tools, and a lot of stuff that, you know, you only need to do once, but it is still kind of, you know, a little bit of effort to get going, especially if you’re trying to bridge from other formats.
And so one of the things we’re doing is building out some native web services, like cloud-based notebooks, where you can go out and you can put in a URL in your browser, and you get a little Jupiter Notebook. And you can play with the cells and start making VRS objects immediately, no installation required, so you can just start playing with it. And this is one of the things that we’re going to be talking about this summer at the bioinformatics open source conference at ISMB ’22. And we’re going to be playing with it a little bit at our hackathon.
Angela Page 25:17
Alright, so you’ve mentioned a few of the variant classes that you guys have rolled out already. Where are you going next? What is on the radar for the VRS team, and where maybe people could see themselves getting involved in sort of the next steps of the work?
Alex Wagner 25:29
Yeah, so where are we going next? Okay, so I think that one of the big gaps that remains for variant representation is how — you know, I mentioned earlier bridging from where we are today to where you want to be tomorrow.
There’s an entire missing vocabulary around variation. We call this categorical variation. And this vocabulary is really around how you talk about variation as a domain entity: so like a collection of different variants that all kind of correspond to that same piece of knowledge.
You know, if you do a clinical trial, and you get a bunch of patients, and they all have a shared characteristic of their variation, you report on that characteristic. There’s no language for describing that characteristic. All you can do is represent the individual variant that was observed.
And what we do in knowledge bases today is we just pick kind of like one representative context, and we hope that there’s a match. But you might miss a bunch of stuff, because you pick the representative context that doesn’t match the context you observed in your patient.
And so what we want to do is provide that vocabulary that you know, that syntax, for saying, here’s the collection of variants that would match this knowledge. We can then build an engine for it to match the observed variants in your assay to that knowledge.
And that categorical variation specification is part of what we’re calling VRSATILE, which is the VRS Added Tools for Interoperable Loquacious Exchange. (Previously known as interoperable “Larry” exchange!)
And, you know, the VRSATILE Framework also contains this other specification called Value Object Descriptors. And Value Object Descriptors are a way of being able to put a bunch of different human-readable labels on top of these computable objects.
So if you’re used to talking about 123 Main Street, you can put 123 Main Street on top of the x and y coordinates for your geographical position. So that when people receive that, they’ve got both that computable form that they can use for that exchange, but they can also put the labels on that they’re used to seeing.
And these two things together comprise VRSATILE.
Larry Babb 27:22
We’re working with some already well established resources, like we’ve already talked about ClinVar and CIViC. We talk to NCBI and ClinVar and keep them up to date on what we’re doing. We’ve actually talked to the folks that have put together dbSNP, and we’ve actually gotten an early version of VRS in the dbSNP data files. When you download them, I think they have the ability to, in their API service, to look up by VRS IDs, or to at least get the VRS IDs for dbSNP references.
But we also worked with the ClinGen allele registry, which has become a very pretty well known resource and service that many folks in the industry are looking towards to sort of solving an even bigger problem than what VRS dealing with. You know, not only are we dealing with contextualising and normalising the variants, but this notion of when you have a bunch of variants that are all sort of derived from each other, you know, you have this one change down in your DNA, and that ends up creating a change in in certain transcripts that it happens to align with. And therefore it changes to a protein that those transcripts produce.
A lot of folks like to think of all the changes whether it’s the protein, the transcript, and the DNA genomic change, all as sort of the same variant. And there’s that notion is called the canonical variation. And the ClinGen allele registry deals with this very well. They often refer to it as de-duplicating variants, so that you can really help discover and find them, and find the information related to all the different forms of it.
This is a huge deal in the curation work that is done by so many industries out there that are trying to collect knowledge around variants and trying to help learn whether those variants are pathogenic for disease or deleterious, or benign, or, you know, have some effect on therapeutic response for a drug, or whatever it might be.
So we are working with all those groups. And we will continue to build this up and try to get the entire community working on all parts.
So if you’re out there, and you’re working on anything related to this, and you want to jump in and sort of test out, how can we use VRS to sort of be part of the game? Come and join us at one of our meetings, send us an email, get on the GitHub repository. You know, you’ll see Alex and me all littered all over there. And you can make a reference to us, and we will do our best to help you become engaged.
Angela Page 29:45
Well, I think that is a fabulous note to end on. Thank you both so much for being here. It’s always a pleasure to speak with you.
Alex Wagner 29:51
Thanks for having us. It was a real pleasure to be here today.
Larry Babb 29:53
Yes, very enjoyable. Thank you, Angela.
Angela Page 29:58
Thank you for listening to the OmicsXchange, a podcast of the Global Alliance for Genomics and Health. The OmicsXchange is produced by Connor Graham, Stephanie Li, and Julia Ostmann, edited by Biljana Gaic, with music created by Rishi Nag. GA4GH is the international standards organisation for genomics, aimed at accelerating human health through data sharing. I’m Angela Page, and this is the OmicsXchange.