Episode 15: The first gapless human genome sequence with Karen Miga


Angela Page  00:03

Hello, and welcome to the OmicsXchange. I’m Angela Page. Today, I’m speaking with Dr Karen Miga about her work in sequencing, for the first time in history, a truly complete human genome. Karen is an assistant professor in the Biomolecular Engineering Department at the University of California, Santa Cruz, and Associate Director at the  UCSC Genomics Institute. She co-leads the Telomere-to-Telomere Consortium and is the Project Director for the UCSC production centre of the Human Pangenome Reference Consortium, or HPRC. GA4GH recently began working with Karen and her colleagues in the HPRC to stand up a new initiative called the Human Pangenome Project, which seeks to bring together a global community of researchers interested in developing a more representative human genome reference sequence. So, thank you so much for being here with us, Karen.

Karen Miga  00:56

Thanks so much for having me, Angela. This is great.

Angela Page  00:59

So, to get us started, could you just tell us a bit about this recent announcement that we’ve been seeing, the first gapless human genome sequence? What does that mean, and what technologies were necessary to achieve this?

Karen Miga  01:09

Yeah, the promise of the first human genome project was to get to know our genetic code base by base, completely and comprehensively. And we didn’t meet that challenge back in 2003, there was roughly 8% of our genome that was missing. These missing parts of our genome are really marked by large persistent gaps. So, we’re talking about 200 million bases, which we’ve just turned a blind eye to in terms of research for the past two decades. So, our team really took advantage of [the] moment, this is a technological advantage, as you’re getting that with long-reads. And the past researchers have had to clone or try to stitch together or do genome reconstruction or assembly using short-read platforms. But now we really took advantage of two key platforms. This is Oxford Nanopore, in which issued read links which are routinely 100,000 base pairs plus, and the PacBio Hi-Fi data, which isn’t as long, but was wonderfully accurate. So, we’re talking about reads that are like Sanger-quality. And it’s really based on both of those platforms, the highly accurate reads, and extremely long reads, that we were able to stitch together a final version of the complete genome.

And so the gaps that you see the things that you miss doing short reads sequencing, what is it about long-read sequencing that allows you to uncover that. So, these regions that were missing, we’re missing because they were complex, and they were hard. In other words, they were repeats. So, you had stretches of sequences that had an exact copy, if not a near-exact copy many, many times in the genome. And so this made it very difficult to get the ordering the localization and the mapping correct. People could make predictions in the past, but even just confirming that those predictions were correct, was just beyond what we could do with the time and the scope of technology. So, [repeat assembly] was the name of the game here.

Angela Page  03:04

And you also had an announcement back in June. So, what is the difference this time around versus what you guys released in June? 

Karen Miga  03:11

Right, so our consortium, we take pride and trying to do [an] open and early release of all of our datasets. As soon as they’re quality assessed, we release them to the public, when we also tried to embrace the preprint system. And that has been incredibly great for our team, it has allowed [the] recruitment of new team members and feedback on our paper that we’ve been able to incorporate. And so, that’s why we made an earlier kind of news [in] June. But our most recent release has a couple of key advancements that would be of interest to the community. One is that we were able to include all of the ribosomal DNA arrays, which are on the acrocentric [chromosomes], which were presented as a gap and our first assembly release. And also the cell line that we were using was XX, so it didn’t contain a Y chromosome. And so our team since doing that preprint release put our heads down, and really aim to release it fully T2T complete and [an] accurate Y chromosome. So, that was part of our most recent release, we’ve been working with a lot of amazing colleagues in the field to ensure that these datasets can be accessible on a genome browser with all the RefSeq annotations and gene annotations. And that also took some maturity and the release of the first reference assembly was one moment, the next moment was building all of these really impressive biological stories or findings using that sequence. And that also kind of launched since the June release as well.

Angela Page  04:37

Interesting. That’s very intriguing. You’ve mentioned T2T now. And the consortium so you foreshadowed my next question, what is the T2T Consortium?

Karen Miga  04:44

Right. So it started off as kind of a conversation with Adam Philippi, who’s at NIH and myself, UC Santa Cruz, where we both saw the power of long-reads and we just wanted to know if we could use this moment this technology to complete a human genome. So we didn’t have a consortium, it was just two people, you know, trying to put things together. And then we announced the complete T2T X chromosome as kind of a milestone AGBT [in] 2019. We set up a website, calling it the Telomere-to-Telomere Consortium, or T2T Consortium, and just created an open grassroots effort saying, hey, we want to complete a genome. Who’s with us? And we just had a tremendous group effort of, I hate to say volunteers, but this was not a call-to-arms from any funding source. This was really a group of scientists who came together saw the importance of reaching a complete genome and saw the moment of being able to do this and really just made it happen.

Angela Page  05:39

That’s amazing. So yeah, I mean, I was gonna ask about collaboration, and you already got there. I mean, your first conversation with Adam was sometime before 2019, doesn’t feel like quite a long time in scientific years, to get to have achieved as much as you have over the last couple of years. And then I guess you might point to the technology again, there. But could you talk about that, you know, how you were able to do this so quickly?

Karen Miga  05:59

Yes, in fact, I want to really highlight how much has changed since that moment, when we first started generating what we call ultra-long-sequences on the nanopore platform. These are the 100 KB plus, [which] wasn’t even possible on the Promethean at that time. So, what Adam’s team did at NIH is they generated like close to 100 flow cells on a MinION, for those of us who do nanopore sequencing, it seems crazy in terms of the amount of sequencing work that had to be done that which we can do routinely now. 

Angela Page  06:29

So, kind of maybe along those lines, I would like to understand because my first connection to you — [my] introduction to you and your work was through the Human Pangenome Reference Consortium. So, what is that? And how does that relate or intersect with — relate to the T2T Consortium? How are these two streams of work related to each other?

Karen Miga  06:47

They are related in the sense when we think about a complete reference genome, we need to start thinking in two dimensions. One is the one we’ve been focusing on at the beginning of this podcast in the sense of complete, meaning all the bases are there in front of you for a genome. But that only gives you one instance, one genome. And we know that the genome that we were releasing represents a European haplotype. And where we want to go is a better genome that represents all of the genetic information for our species, or something that can be better representative of a large number of geographical, genomic haplotypes. And so that is the goal of the Human Pangenome Reference Consortium, [to] move in two directions to ensure that we have the technology in place to reach these finished T2T diploid assemblies. And that’s where our team kind of steps in as one of the working groups as part of the pangenome. And in addition to that, there’s this whole other layer of how do we create something that’s a reference genome that best represents humanity? And to do that there’s a lot of thought that needs to go into whose genome, how many genomes? How do we best position this genome as a resource to benefit [most] individuals around the world? And that involves training and education and resource and tool building and all of the things that are strengthened by being partners with GA4GH, for example, which is when we first started talking, so I feel like there’s a huge amount of work to reach that second dimension and [the] first dimension [that’s] really where the pangenome and the T2T come together.

Angela Page  08:22

That’s great, that really is helpful to have it fleshed out that way. And so, I guess you’re alluding to when you talk about this multi-dimensional reference, you’re alluding to a pangenome and the consortium is the Human Pangenome Reference Consortium. So for those who are not familiar with that concept, can you explain what a pangenome actually is? 

Karen Miga  08:40

In its most basic way of thinking about it, the pangenome is a collection of reference resources. It’s kind of a collection that can best represent common alleles or common variants. And a population, people used to do pangenomics, with yeast genomes and with other less, I guess, complex genomes in the sense of trying to build this type of structure. Now, the nice thing about it is outside of just being a collection of references, you can begin to organize it into a more beautiful data structure. And we’re playing around with a lot of different models of how to structure that now in order to identify shared regions or shared haplotypes and regions where individuals differ, and kind of a natural way. And the way that we’re kind of landing on that challenge is using a graph genome at the moment. And we’ve seen a tremendous amount of progress not only with building the reference genome using a graph, but also all of the necessary tooling that needs to be employed to create that ecosystem that people could use short-reads and map, you know, almost in the same way seamlessly that folks have done with linear genomes and gain just a richness of information they didn’t have before.

Angela Page  9:51

So, that is definitely I mean, we could talk about the science, the technology behind that for probably an hour, and I have a lot of questions, but I’m gonna save them. Because what I want to get into is this next group that is emerging, which is the Human Pangenome Project, which as I understand is kind of named in honour of the Human Genome Projects. It’s sort of like the next era. And you know, GA4GH, NHPRC. This is, [again] where our conversation started. And in this has just been, I think, launched just in the last year. So, what is the Human Pangenome Project? And how does it differ from HPRC? Then when we can get into some of those questions about that.

Karen Miga  10:31

Right, so I guess the idea that we being the researchers around the world need to create a resource that best serves humanity, it’s a global effort. This is something that requires a lot of individuals and researchers and expertise from around the world to kind of come together, sit at a table and try to create the next reference genome that everyone will use and build upon. This was definitely in tune with the first human genome project, it was an international project, it was by design, for that same reason, we’re hoping to create a new coordinate system that everyone can build on and everyone can benefit from the Human Pangenome Reference Consortium, and was launched is largely funded by the United States, by NIH. And we have of course contributors and collaborators from around the world. But it is still in that kind of scope, where we’re, we’re working with, as I would say, a relatively impressive amount of genomes that were aiming over five years about 700 phased haplotypes. But we are largely at this moment focused on resources that were previously established in [the] 1000 genomes consortium. And we’re working to establish new recruitment centres, to invite participants into our study, and to domestic biobanks, such as BioMe in New York City, which really benefits from having a tremendous amount of individuals who are in that area who offer geographical and genomic diversity just by moving [to] New York, right. And so what we would love to do is to partner with researchers outside of the United States, and a more equal way where we can work together with GA4GH, to establish something and the training and the education and the outreach to make sure the benefits of this pangenome extend, pass the grantsmanship that’s been housed here in the United States. And our grant started about when the T2T started. And so we’re in year three right now, we’ve been making tremendous progress. But it’s largely been keeping our heads down with the technical challenges of trying to do this when it’s our hope that over the next two years of our project, we can really begin to lean in and start talking to other researchers and start working toward this new goal, this broader goal of this global initiative.

Angela Page  12:46

Which would you say that this is a sort of longer-term thing to a longer-term more sustainable sort of strategy for ensuring representation, and equity for the in the human genome reference kind of going forward?

Karen Miga  12:58

I would, and in fact, that’s almost exactly the point that we’re trying to hit our genomes belongs to humanity. There’s no country border. And we just need to make sure that the tooling, they use the ethics, the thought processes that go into this, that we’re working together as scientists, as a global organization of individuals to ensure that benefits the most people around the world.

Angela Page  13:24

So, I don’t know it might be too early, but who do you have onboard at this point? And who are the partners in HPP? Right now, and who do you see coming in, in the future?

Karen Miga  13:33

Right. Well, it is early, but what we’re trying to do is we’re trying to establish a policy paper where we can clearly as a consortium, sit down on and think about how this could happen. And then we want to put it in the same way as a T2T kind of on a preprint server, open access, and really use it as an opportunity to invite feedback to invite teams to join us. Now there are some really natural teams who, who we’ve been working with really since we first opened our consortium, this is like a National Center for Indigenous Genomics has always been a wonderful collaborator with us thinking deeply about how to do engagement and think about indigenous genomics in the future. Also, the Human Heredity & Health in Africa or H3Africa, we’ve had tremendous conversations and working with researchers to ensure that pangenomic development that’s happening through separate channels can [be] compatible with the tooling and standards that we’re trying to do as well. But there [are] numerous little projects like that, that we’re trying to explore how can we increase sequencing projects, with collaborators, you know, that have their own funding through different mechanisms, but we just like this to be more open, more equitable. And just the same way, it’s lovely that the T2T this way, kind of create that platform and open the doors and say, join us. And if you want to join us, and you have problems joining us, let us know, because we would love to do everything in our power to create some playing field of equity, if we can to ensure there’s representation outside.

Angela Page  15:03

Yes, I think that everybody listening should hear this as an invitation. Right? And there’s a website, you can I think, And maybe is that it?

Karen Miga  15:13

Yes, if you google Human Pangenome Reference Consortium, it’s located there as well.

Angela Page  15:17

So, you mentioned standards. And this is obviously the question I always have to ask being at GA4GH. So what are the sort of standards policy opportunities in this space that GA4GH, could help tackle or, you know, is there a role for GA4GH, and, you know, standards in graph genomes, pangenomes, T2T, all of this?

Karen Miga  15:37

It’s harder for me to point out where there wouldn’t be a role for GA4GH at this point. I mean, there [are] just so many touchpoints where I feel like our teams could just benefit from having a closer alliance. And not only the technical aspects of this, that genomic sequencing and putting the assemblies together and trying to create this reference resource. So the human pangenome, that’s kind of its own technical challenge, but all of the ethical considerations, the data-sharing platform, the, you know, how do you standardize the tools to map and determine [the] clinically important variance? How do you do the [educational] outreach programs to ensure equity, I mean, there’s, I don’t have time on the podcast to go into how beneficial I think it is to work closely with GA4GH to ensure that this resource is global. I think that without that type of expertise, and without that type of infrastructure, I think that gives so much more, I guess, confidence and hope, for me at least, that this is really going to be something that’s going to have a strong foundation as we start to move it more into a global platform.

Angela Page  16:42

So I’m gonna ask one more question. And I think maybe we’ll, we’ll stop there. But I guess just bringing it all back home. And you know, a lot of times, it can be easy to get so excited about the science that we forget the reason we’re doing this. And so could you just talk a little bit about why it is so beneficial for, you know, for patients for human biology, for clinical outcomes, to have this kind of a reference genome that is more representative and brings in geographic diversity and, and is not a single human. Just explain to me the downstream impacts that doing this is going to have?

Karen Miga  17:17

Right, I think our team has been largely focused on thinking about how this could influence new biotechnology, new clinical genomic variants that are being discovered and new cures and hope in that direction. And a lot of the variation that our species has is large, and it’s hard to track using short reads. And it’s hard to track with GWAS, even where we have kind of an overabundance of, of European ancestry banks and these particular experiments. And we know that there are opportunities to help benefit communities that have an enrichment of certain genetic traits and genetic medical conditions that can benefit from the promise of precision medicine if we just provided the information to help guide those experiments in biotechnology. And if we can do and position that all early enough, then our hope is that, as you mentioned before, [there’d] be more of an equitable landscape here so that folks could begin to think deeply about how to do new drug targets, how to, you know, think about how to how to carefully improve health care for individuals who may not be of European ancestry. We have a tremendous amount of information already in terms of thinking about this and biotech companies are already kind of focused on this. So, it’s kind of to shift the model a bit [to] broaden our scope, and ensure that genomic representation of healthy individuals is there, so we can start to build on top of that in a more logical way. And really it is these larger events, I think our team is going to bring in, that we couldn’t really detect before confidently with the shorter-read platforms. And so, we’re hoping that a large number of these structural variants will start to be part of the new narrative, too.

Angela Page  19:01

Okay, so I actually do have one more question. When we’re talking about equity, and you’re talking about these really new technologies, how well dispersed is the technology around the globe to scientists? Is it really impossible to get broad representation in the HPP? Yes, you know, if the technology is not accessible everywhere, or is it? 

Karen Miga  19:21

Right, there [are] a lot of challenges, I think that we would be naive to not fully understand the depth of challenge we’re up against. And one thing that we’re hoping to take advantage of is some of the progress that’s been made on cloud-based resources like the AnVIL platform, where a lot of these expensive resources to download and [analyze] locally, can be housed in the cloud. So, that’s one mechanism. But I think a lot of researchers in the GA4GH community in particular, but also, and others are trying to take advantage up to make it more economical and more, I guess, accessible and equitable. 

Angela Page  19:54

Well, this has been fascinating. And I think what you have just alluded to, again, is like this is all happening right now. And you’re kind of figuring these questions out as you go. And I think it’s a great time for people to get involved, clearly. So thank you. Thank you so much for talking with us. Is there anything else that you want to add or anything we didn’t get to?

Karen Miga  20:13

No, this has been great, thank you. This is fantastic. Thanks for highlighting the work. And of course, anyone who’s interested, please go check out the Human Pangenome Reference Consortium, just like the T2T, you know, we’re always looking for folks to lean in and be part of the progress here, too.

Angela Page  20:28

Awesome. Thank you so much, Karen.

Karen Miga  20:29

Thanks. Appreciate it. 

Angela Page  20:34

Thank you for listening to the OmicsXchange, a podcast of the Global Alliance for Genomics and Health. The OmicsXchange podcast is produced by Connor Graham, Stephanie Li and Julia Ostmann, edited by Biljana Gajic. With music created by Rishi Nag. GA4GH is the international standards organization for genomics aimed at accelerating human health through data-sharing. I’m Angela Page and this is the OmicsXchange.