How we work

Large Scale Genomics vision statement

Read the 5-year vision statement of the work stream or read the full GA4GH Connect Strategic Plan.

Motivation and Mandate

High-throughput sequencing projects continue to produce data at a massive and still accelerating scale. The generated information is having an impact on everything from basic science to everyday healthcare. The challenges for large scale genomics are clear. The vast quantities of raw sequencing data and derived results mean that we must continue to develop highly efficient and standardised formats and API interfaces to store, access, and analyse sequencing reads, genetic variation, and gene expression information. Perhaps the most significant challenge is to move from the traditional purely file-based approach to accessing and analysing genetic data, to one where we are presented with standardised API interfaces.

Existing Standards

Incumbent file formats for read sequencing data include SAM/BAM/CRAM and VCF/BCF for genetic variation. Large cohorts of genetic variation data are now available from multiple different resources (e.g., ExAC/gnomAD, dbSNP, 1000 Genomes, EVA, EGA).

Proposed Solution

As genome sequencing becomes integrated into national and regional healthcare initiatives, it is not realistic to assume that all human genetic and phenotypic data will be stored in a small number of large repositories. Carrying out queries remotely across these repositories opens up the possibility of making new disease associations without the need to physically download all of the data to a single location. Reliably processing and managing information at this scale requires robust software architecture and widely supported standards. The Large Scale Genomics Work Stream engages sequencing vendors and key sequencing and bioinformatics tool developers to ensure that the primary data formats and libraries are evolved and adapted to meet this need. It also coordinates closely with a variety of Driver Projects to support adoption and implementation of APIs for access to large scale projects or databases. The guiding principles of this workstream will be:

  • Engage with driver projects and the wider genomics community to identify requirements and use-cases.
  • Build on existing standards to ensure a gradual transition to new standards.
  • Engage with key community software tool maintainers to drive adoption of standards.
  • Engage with key large data repositories to drive community adoption.
  • Metric for workstream success will be adoption of standards.