Scaling VCF for a genomic revolution

8 Sep 2023

This guest blog post provides updates from the Future of VCF Working Group.

The lower cost of producing, and the increased capacity for analysing, human DNA sequencing data has led to an explosion of genetic variation data available for research.

Additionally, national healthcare initiatives are more regularly doing genomic sequencing. Projects like Europe’s 1+ Million Genomes Initiative, the All of Us Research Program in the United States, and Australian Genomics are especially important for research aimed at understanding common and chronic diseases at the population-level with data from millions of individuals. These large genomic data initiatives have the potential to improve quality of life through better understanding of how to monitor, diagnose, and treat diseases. 

This ballooning of data in human cohort studies means that ways of representing genetic variation also need to scale. 

The standard file format for representing genetic variation — Variant Call Format (VCF) — is maintained by the Global Alliance for Genomics and Health (GA4GH) Large-Scale Genomics (LSG) Work Stream. Researchers rely on VCF to produce, process, and analyse data.

“VCF is ubiquitous. If you generate genetic variant information somewhere in your workflow, there’s going to be VCF. It’s the de facto interchange format for variant data,” said Oliver Hofmann, Co-Lead of the LSG Work Stream.

You probably use VCF to efficiently store and transfer variation information. VCF also specifies how to represent genetic variation data so that any person or computer can understand them. These features are crucial for data sharing: without them, researchers would waste valuable time and resources interpreting and processing variation information.

VCF, while crucial to nearly all genetic variant bioinformatics workflows today, is at a crossroads to handle rapidly scaling and increasingly more complex data.

“In the current VCF specification, file sizes grow superlinearly due to increasing numbers of largely raw variants discovered with increasing sample size,” said Working Group Co-Lead Albert Vernon Smith, computational geneticist and research faculty at the University of Michigan.

To tackle these kinds of challenges, interested members of the genomics and bioinformatics community come together in the GA4GH Future of VCF Working Group to brainstorm solutions, shape the future of variant representation, and explore successful approaches to put into practice. We welcome anyone interested to join.

In this post, you will gain insights into the challenges and solutions associated with scaling the VCF for genetic variation data. Discover how these solutions are shaping the landscape of genomics research. Read on to learn about the importance of maintaining interoperability with other standards, the impact of VCF on large-scale genomics projects, and how you can actively contribute to — and benefit from — advancements in scalable VCF.

Meet the Working Group

The Future of VCF Working Group formed in 2019 and meets monthly to discuss VCF scaling challenges and review different approaches that have been developed to solve them. The group’s main goal is to assess scaling solutions, to make standards recommendations for GA4GH, and ultimately ensure that VCF keeps up with increasing data size and complexity.

Scaling of VCF is important to maintain interoperability with other GA4GH standards, including htsget for downloading variation data from specific genomic regions; Beacon for querying for genomic variants of interest; and the CRAM and SAM/BAM file formats for representing sequencing read data.

What we’ve learned about scaling VCF

A major VCF scaling challenge results from the need to represent genetic variation across millions of individuals simultaneously. 

In VCF, each row represents a genetic variant, and a column represents the actual variants or references observed in each sample. When you add new samples, you naturally need to add new rows to represent newly-observed variants. But you also need to update all the existing rows to indicate that the old samples did not contain the new variants — or that they contained a different form of a new variant (such as a longer or shorter indel). All of these updates expand the existing data held by all other samples. This can make the file size grow faster than just linearly. 

Thus, the consequence of storing more complex information for more samples is that as the number of samples grows, the size of the file scales in a superlinear fashion. 

Significant increases in the size of VCF files causes workflows to be slower and makes sharing and analysing files more time-consuming and resource-expensive.

For example, export of data from gnomAD — the Genome Aggregation Database containing harmonised exome and genome sequencing data from large-scale sequencing projects — to VCF is in the petabyte range. Doubling the size of gnomAD, for instance, would make export and interchange in VCF format completely infeasible.

Strategies to localise these effects and remove this file size growth are evaluated by members of the Future of VCF Working Group at our monthly meetings.

One example is the Sparse Allele Vectors (SAV) file format for storing very large sets of genotypes and haplotype dosages that produces small file sizes and is optimised for fast association analysis.

Another is the Scalable Variant Call Representation (SVCR), a format which takes advantage of reference block compression to guarantee linear scaling with the number of samples.

Finally, the Sparse Project VCF (spVCF), another evolution of VCF, works by selectively reducing high-entropy quality control information, as well as run-length encoding reference coverage information.

“One positive outcome from the Working Group is a much better understanding and characterisation of the scaling problems, along with a set of potential solutions for review,” says James Bonfield, maintainer of the related GA4GH CRAM standard and principal software developer at the Wellcome Sanger Institute.

What does the future of VCF look like?

The current scaling solutions being discussed can be broadly considered as three related approaches.

The first approach is modifications to the VCF specification that makes it scale linearly (or ideally sublinearly), such as those approaches taken by SAV, SVCF, and spVCF. This solution is relatively easy to accomplish and could gain adoption fast. It is necessary as an interchange format between tools and for compatibility, but adoption depends on coalescing in the community to decide on the way forward.

Second is a new binary file format engineered to support large dataset sizes and different operations — for example, querying all data from one sample, or querying one region across all samples. Several potential solutions using this approach already exist: for example SAV is an extension to the binary form of VCF (BCF). This approach could be challenging for the community to adopt, as it requires multiple community libraries to implement support.

Third is an API-based solution with a protocol for how to get the data as well as the format that data are returned in, which could be any VCF format. This is a hybrid approach of the above two solutions and provides opportunities for implementations adapted for specific applications.

The quickly evolving nature of genomics research requires that VCF remain flexible to emerging uses and requirements. Therefore, we believe that this third hybrid approach,  which exploits VCF specification improvements in parallel with development of protocols that interact with VCF, is the most promising moving forward.

“Successful adoption of VCF scaling solutions — like modifying the VCF specification, engineering a new binary file format, or developing an API-based solution — is vital for the scientific community. These solutions will enable the reuse of large population-based genomics and health datasets,” says Working Group Co-Lead Mallory Freeberg, coordinator for the European Genome-phenome Archive (EGA).

The popular method of virtual cohorts will also benefit from VCF scaling solutions. By using virtual cohorts, you can increase the power and reach of computational studies without needing to design, fund, and execute a single large-scale, expensive research study. This is especially important for studying rare diseases, where research often relies on smaller populations from multiple studies conducted over time. Adopting scalable and standardised VCF solutions enhances the ability to efficiently perform joint analysis of data from multiple sources as a virtual cohort. 

The Future of VCF Working Group offers an ideal collaborative setup to tackle the scaling challenges facing researchers and data service providers.

“The sensitivities associated with large-scale genetic data naturally require cautious approaches to collaboration among major projects. Given this context, the GA4GH Future of VCF Working Group is a unique forum for exchanging ideas and experience that might otherwise remain siloed,” says spVCF maintainer Mike Lin, a GA4GH contributor.

Make VCF work for you

Do you care about or have ideas for the future of VCF? Join the Working Group! It’s a great way to learn about cutting-edge solutions to scaling challenges, network with similarly invested scientists working on those solutions, and contribute to a meaningful file format that is foundational for genomics research. Here are some ways to get involved:

  • If you have a general interest in the future of VCF, join the monthly calls or our Slack channel to hear updates. You can present a use case or solution you are developing to address VCF scaling challenges — for example, if you are developing or maintain a bioinformatics tool that uses VCF files — and receive feedback and ideas.
  • If you want to get involved in a more technical capacity, we invite you to engage by raising or commenting on issues and pull requests on GitHub to improve the VCF technical specification. We also welcome contributions of benchmarking data and benchmarking approaches to compare the outputs of VCF scaling solutions.
  • Finally, we are actively seeking individuals to co-maintain the VCF specification and welcome anyone interested in this role to contact GA4GH staff member Reggan Thomas.

Conclusion

The explosive growth of genetic variation data has underscored the imperative need for scalable representation methods. By investigating and proposing solutions for scaling the VCF, the Future of VCF Working Group holds the key to unlocking the potential of large-scale genomics projects to advance our understanding of disease.

Joining this Working Group is not just an opportunity to contribute and network; it is an invitation to actively shape the landscape of genomics research and ensure its progress towards meaningful outcomes.

Are you planning on attending the GA4GH 11th Plenary or Connect meetings in San Francisco? Does your research benefit from or could be impacted by change in the VCF standard? Are you interested in building a stronger connection to the GA4GH and genetic variation communities?

If so, we would love you to join the Future of VCF Working Group! Complete the sign-up form (choose “Large-Scale Genomics” and then “Scalable VCF”). We look forward to seeing you on a call soon.

Authors

Mallory Freeberg, James Bonfield, Albert Vernon Smith, Mike Lin, Oliver Hofmann

Latest News

Logos for the Research Data Alliance (RDA) and GA4GH, which are forming a strategic relationship
11 Jul 2024
GA4GH and the Research Data Alliance (RDA) agree to a Strategic Relationship to advance responsible data sharing
See more
Birds eye view of people walking on a street, connected by a network.
2 Jul 2024
Public attitudes for genomic policy brief: genomic data sharing in Singapore
See more
Puzzle pieces coming together against a binary code background
25 Jun 2024
Uncovering and overcoming common data sharing challenges in the Rare Disease landscape
See more