High-throughput sequencing projects continue to produce data at a massive and still accelerating scale. The generated information is having an impact on everything from basic science to healthcare provision. The challenges for large scale genomics are clear. The vast quantities of raw sequencing data and derived results mean that we must continue to develop highly efficient and standardised formats and API interfaces to store, access, and analyse sequencing reads, genetic variation, and gene expression information. We must continue to adapt and evolve genomics formats as new sequencing assays (e.g. single-cell, methylation/base modifications), platforms (e.g. long reads), and increased scale (e.g. population scale variation – UK Biobank) are created to ensure the data remains interoperable. Perhaps the most significant challenge is to move from the traditional purely file-based approach to storing, accessing, and analysing genetic data, to one where we are presented with standardised API interfaces.
There are a variety of incumbent file formats for read sequencing data include SAM/BAM/CRAM, VCF/BCF for genetic variation, and tabular formats for expression and genomic ranges (e.g. BED). Large cohorts of genetic variation data at increasing scale are now available from multiple different resources (e.g., ExAC/gnomAD, dbSNP, 1000 Genomes, UK Biobank, EVA, EGA).
As genome sequencing becomes integrated into national and regional healthcare initiatives, it is not realistic to assume that all human genetic and phenotypic data will be stored in a small number of large repositories. Carrying out queries remotely across these repositories opens up the possibility of making new disease associations without the need to physically download all of the data to a single location. Reliably processing and managing information at this scale requires robust software architecture and widely supported standards. The Large Scale Genomics Work Stream engages sequencing vendors, key sequencing and bioinformatics tool developers, and population scale driver projects to ensure that the primary data formats and libraries are evolved and adapted to meet this need. It also coordinates closely with a variety of Driver Projects to support adoption and implementation of APIs for access to large scale projects or databases. The guiding principles of this work stream will be:
VCF is a text file format (most likely stored in a compressed manner). It contains meta-information lines, a header line, and then data lines each containing information about a position in the genome. The format also has the ability to contain genotype information on samples for each position.
This project focuses on improvements to the VCF format (e.g. representation of structural variants, scaling to population size collections of genotypes) as well as looking into potential future formats to handle the increasing scale of large sequencing projects more efficiently.
SAM stands for Sequence Alignment/Map format. It is a TAB-delimited text format consisting of a header section, which is optional, and an alignment section. The BAM file format is binary equivalent of the SAM format. CRAM is a related compressed columnar file format that uses optionally differences to a genomic reference to reduce storage cost. The vast majority of sequencing data produced worldwide is stored in the using the SAM representation, underscoring its importance as a key format for both research and healthcare genomics.
This project describes ongoing additions to SAM/BAM to improve efficiency and support new sequencing platforms. For CRAM the primary focus is on improved efficiency, random access for long reads and CRAM file sizes for short read data (with an expectation of 15-20% reduction). Additionally improved indexing, and especially index documentation.
htsget is a protocol for secure, efficient and reliable access to sequencing read and variation data. This project describes ongoing improvements to the protocol and implementations to support ‘two-dimensional’ slicing of very large variant (VCF) datasets, e.g., N=1,000,000 WGS. Such datasets need to be accessed efficiently not only by genomic range, but also by subsets of the cohort — potentially arbitrary client-determined subsets. To support the more-complex queries involved, this will imply evolving the basic protocol to support “POST” HTTPS requests in addition to the “query string” request format used currently.
Currently, refget API focuses on single reference sequences uniquely identified by their checksums. The refget API service offering will be extended to reference sequence collections, e.g. genome assemblies, and to reverse lookup of reference sequence. Like individual reference sequences, reference sequence collections would be associated with unique identifiers computationally derived from the set of sequences themselves. The reference sequence collections are envisioned to make it easier to share and exchange commonly used and semantically meaningful sets of sequences. The reverse lookup of reference sequences would provide an alternative to checksum based services, allowing commonly used sequence names such as ‘Chromosome 1’ from human genome assembly ‘GRCh37.p13’ to be used. As the reverse lookup of individual sequences typically requires the reference sequence collection to be also specified, we envisage that the reverse lookup service may also be extended to reference sequence collections.
Both systems require development of a transmission format to communicate these data/concepts between clients. An additional API will be developed to support ambiguous reverse lookup systems developed from said formats.
Crypt4GH is a file format that can be used to store data in an encrypted and authenticated state. Existing applications can, with minimal modification, read and write data in the encrypted format. The choice of encryption also allows the encrypted data to be read starting from any location, facilitating indexed access to files.
The RNAget API describes a common set of endpoints for search and retrieval of processed RNA data. This currently include feature level expression data from RNA-Seq type assays and signal data over a range of bases from ChIP-seq, methylation or similar epigenetic experiments.