OmicsXchange Episode 12: The Need for Further Inclusion of Diversity in Studies: An Interview with Nicola Mulder

10 December 2020

driver projects, OmicsXchange, podcast

Angela Page: Welcome to the OmicsXchange, I’m Angela Page. In today’s episode, I’m speaking with Professor Nicola Mulder from the University of Capetown on a large-scale, collaborative effort spearheaded by the Human, Health, and Heredity in Africa Consortium, or H3Africa, to sequence genomes from regions and countries in Africa that have historically been missed or overlooked. Their key findings were recently published in the journal Nature. Nicky has been a part of the GA4GH community since its inception, and is now a Driver Project Champion of H3Africa. Welcome Nicky.

Nicky Mulder: It’s a pleasure to be here. Thanks again for the invitation to speak.

Angela Page: To start us off, can you give us a brief overview of H3Africa and its mission?

Nicky Mulder: H3Africa is the Human, Heredity, and Health in Africa Consortium. It’s a Consortium of a bunch of funded projects that are looking at the environmental and genetic basis for diseases of relevance to Africa. Each project is focusing on different diseases, and generating different genotype data—some of them are generating microbiome data. And we found early on in the Consortium that many wanted to use a genotyping chip for genotyping their samples. And the current genotyping chips were just not appropriate because they were designed based on European populations. So we decided to design a new array. And in doing so, we then identified samples that were not represented in current whole genome sequence data, because we needed a more complete reference panel for informing that design. So about 360 samples were sequenced as part of that project, and the genome analysis working group within the Consortium, which is responsible for promoting cross-Consortium projects, worked with data providers and said, “Can we use this data to do a full in depth analysis of African genomes?” And so we had this reference panel, and we then added additional data and used that for this particular study to further exploit that sequence data that kind of was a Consortium asset.

Angela Page: So in the Nature paper, H3Africa sequenced 426 individuals from 13 African countries, whose ancestries represented 50 different ethnolinguistic groups. What were some of the key findings?

Nicky Mulder: The paper has provided a lot of insights into both medical genetics and migration of African populations. So I think the key finding is looking at the individual populations that haven’t been sampled before and through those, we’ve found more than 3 million previously undescribed variants. So this means that they’re not present in any of the current databases or known datasets that people have access to. There’s also several regions that have been found to be under strong selection. Of these, about 62 of them are previously unreported. And if we look at the genes that are in these, they’re mostly involved in viral immunity, DNA repair, and metabolism. We also looked at the migration patterns and admixture and splits of populations around the continent, and found that there might be a new route that people hadn’t really thought of or hadn’t discovered before of migration of the Bantu speaking populations across the continent that looks like they may have gone through Zambia, which was not previously known. We’ve also looked at the pathogenic variants, medically relevant variants, and found several variants that have been previously characterized as pathogenic or non pathogenic, which are present at reasonable frequencies in these populations. So each one of the individuals had at least one variant that was classified as pathogenic in ClinVar, with a median of about seven per person. So it has basically given new insights into the characterization of some pathogenic and medically relevant variants, but also provided information about the selection pressure in the African genomes and how the migrations might have worked in the past.

Angela Page: What factors contributed to this study being the first of its kind?

Nicky Mulder: The previous studies have mostly looked at low coverage sequence data, or even prior to that, microsatellite data. But there were still gaps in terms of countries and ethnicities not represented in those studies. And what this study does is it looks at high coverage sequence, which means you can now look at rare variants, and you can look in greater depth at the data. And it filled in gaps that were previously identified. So when sampling the populations, we specifically looked at where there was no public data already available for those countries or regions. The other thing that’s novel is that this was done by African scientists on African data and normally it’s studies on African populations, but the publications are from non-African scientists—so the African scientists were able to provide context and local information that was relevant. It was a really good demonstration of African-led science and collaborative science really involving the sample collectors and data providers. I mean the overall approach was a cross-Consortium collaboration. But we also tried to build in capacity development in that sort of approach so that if there was a team that was well expertised in a particular area, then we had more junior scientists in there that could learn along in the process.

Angela Page: Earlier, you mentioned there are gaps in representation. How did you go about selecting the populations to fill in those gaps that you had seen from previous studies?

Nicky Mulder:  We did a literature search and a database search for what datasets existed for African genomes. And so we mapped these out physically on an African map, and then looked for where there were gaps. But then we also had to think about where we had people on the ground. So we then looked at people within the Consortium, and sent out a general call to say, this is where we’re looking for gaps and who has samples that you’d like to have sequenced. And then some slightly outside the Consortium, like from our IEC member, one of them had a study in Africa and they contributed samples. So it was really looking at where geographically there are gaps, and then where we had people on the ground to collect samples from those. And they had to have the right consent in place to be able to ship those samples and have the sequencing done.

Angela Page: What work has already started or still needs to be done to get an even more complete picture?

Nicky Mulder: The working group has now started on looking at where the next set of gaps are. So we are doing longread sequencing in just a few—because it’s so expensive. So the paper basically highlighted about five primary ethnolinguistic, sort of ancestral groups. And so we’re trying to make sure we have representation of each of those for long read sequencing to build a reference graph and this pan genome. And we’ve also discussed with Illumina by potentially filling some of these gaps and sequencing with short-read sequencing. So these gaps include North Africa, because North Africa is very much underrepresented, and there was a reason for that, because the North African populations are quite admixed with Europeans. But they should be included, to give the broad picture. And for example, we are also looking at the islands, so Mauritius, Madagascar, and Réunion, for example, to include the African Island populations. There are also gaps in terms of the analysis. So you know, the amount of results you can put in a Nature Paper is extremely small, because of the size. There was a lot more work done on it in each of those teams that we can expand on. And so there are a number of sort of smaller groups that are now taking the data and diving deeper and so the capacity that was developed where we’re trying to build capacity in these now more specific areas of analysis of these genomes.

Angela Page: So why is this paper important for the broader human genomics field?

Nicky Mulder: So what the paper has shown is that there is so much more diversity, and so many novel SNPs that are still to be discovered, just by filling the gaps. I mean we looked at how adding new samples, whether that might reach some sort of saturation, and it never reached a plateau. So the more samples you add, the more novel SNPs you’re going to get. So the main important, I think, for the broader human genomic field here is that we haven’t even come close to reaching our knowledge, you know, some plateau of our knowledge of diversity particularly in African populations. And we will always have novel SNPs to discover. 

Angela Page: What are the clinical implications of this study?

Nicky Mulder: Every individual had at least one ClinVar pathogenic variant, as I said, with a median of seven per person. And 262 of these unique variants that were either characterized in ClinVar as pathogenic or likely pathogenic, and of those 21% had a minor allele frequency of more than 0.05 in at least one population. So that means that these are reasonably common in African populations. But then of these, 5% had a frequency of less than 0.05 when you looked at the combined frequency of populations in gnomAD. So what this is showing is that they are locally common in Africa, but globally rare. So the pathogenicity score and their association of pathogenicity to these variants was based very much on frequency and other data from non-African populations. So what it means is that when you bring the African populations in, these are actually found to be common, and therefore can’t be pathogenic as previously described. So I think that’s the main clinical implication is that there are a number of possible misclassifications of variants as being pathogenic, when in fact, they are reasonably common in other populations.

Angela Page: And how will these findings impact future research and clinical care in Africa and beyond?

Nicky Mulder: The data itself has provided a huge new resource because when we’ve been trying to build reference panels—that’s why we sequenced us in the first place—there were those gaps. So I think it will impact research by having just a good reference and improved reference dataset for both research and for clinical data when you’re trying to filter variants. And then also for the reference panel, for example, if you’re doing genotyping arrays, it provides a good reference panel for imputation. I think it’s also to just demonstrate that there’s still so much to be explored. And within this particular dataset itself, there’s still a lot to be explored. So there’ll be a lot more research, both in terms of genetic research and clinically relevant research. So as soon as we build cohorts of disease patients, we can then compare them within the same populations. And I think also just demonstrated the need for further inclusion of diversity in studies and in different classification algorithms.

Angela Page: Thank you so much for speaking with me today, Nicky. It was really a great conversation.

Thank you for listening to the OmicsXchange—a podcast of the Global Alliance for Genomics and Health. The OmicsXchange podcast is produced by Stephanie Li and Caity Forgey, with music created by Rishi Nag. GA4GH is the international standards org for genomics, aimed at accelerating human health through data sharing. I’m Angela Page and this is the OmicsXchange.