25 March 2019
For most of its life, the field of genomics has mostly been a field of research. But as the cost of sequencing plummets and its utility for diagnosing, treating, and preventing disease becomes more apparent, genomics is transitioning into a field of healthcare. With this shift, which comes with new funding mechanisms and larger participant cohorts, the amount of genomic data requiring storage capacity is rapidly increasing.
The Global Alliance for Genomics and Health (GA4GH) has estimated that by 2025, 60 million human genomes will have been sequenced around the globe, with the majority of those coming from healthcare. Each human genome brings with it 30-fold more data as a result of errors in sequencing, base calling, and genome alignment. So while a single genome of 3 billion base pairs could theoretically be stored as .75 Gb (4 nucleotide bases per byte), actual storage requirements are much greater. Coupled with an assortment of metadata types (such as sequence names and quality values), this could be anywhere from 2 to 40 exabytes of data storage per year.
Compared to astronomy (1 exabyte/year), particle physics (75+ petabytes/year), and YouTube (1-2 exabytes/year) — three of history’s most prolific data generators to date — genomics is quickly becoming one of the world’s most storage-heavy industries.
Traditionally, sequence reads are stored in one of two file types: a human-readable text file of sequence data called SAM, or its binary counterpart, BAM. Together with formats for storing variant call data (VCF/BCF), SAM and BAM are maintained by the GA4GH Large Scale Genomics Work Stream and therefore represent the community standards for genomic data storage. Collectively they have been cited more than 20,000 times. But as datasets soar in size and number, institutions need a more efficient data compression mechanism that is interoperable with the field’s existing tools and best practices.
Enter CRAM: the compressed version of BAM. Also maintained by GA4GH, CRAM is quickly transitioning to the field’s preferred file format for storing sequence reads.
While BAM files are considerably smaller than SAM files, they use general purpose compression methods such as ZIP, which only get us so far. “There’s a good reason we don’t use ZIP for images — we use custom algorithms written for that task, such as JPEG or PNG,” said James Bonfield, principal software developer at the Wellcome Sanger Institute. “This is what CRAM is doing. CRAM is a custom algorithm written to compress the BAM data to a much smaller size.”
CRAM uses two primary compression methods. First, it aligns the sequence data to a reference, and only stores the data that is different rather than the whole genome. It also breaks the dataset that comes off a sequencer into its component parts (eg., sequence names, genetic sequences, quality values, etc.). Each of these is compressed independently using an algorithm specific to the data type.
CRAM has been implemented into the world’s most widely used software libraries for genomics (htslib and htsjdk) and is being adopted by leading genomics institutions around the globe including Genomics England, the Broad Institute of MIT and Harvard, H3Africa, Illumina, Inc., Sweden’s National Genomics Infrastructure, and EMBL’s European Bioinformatics Institute (EMBL-EBI) — and many more — which report storage and cost savings of up to 50 percent as a result.
“Even if an organization is not producing large volumes of data, it can be very beneficial to use the same file formats that other organizations are using,” said Bonfield. “CRAM can help in this regard and we don’t have to have lots of different formats and lots of different standards.”
“It would be really impossible to do the science that we do at the scale that we do it, if we did not agree on the basic standards of our data,” said Ewan Birney, director of EMBL-EBI and chair of the Global Alliance for Genomics and Health (GA4GH).
Since the early days of the Human Genome Project, the genomics community has made a deliberate commitment to making its data open in order to afford most benefit for humanity. To adhere to this commitment, open standards and open software for analysing that data must also be available.
“Standards are nearly always best open. Open means that many people can implement them and use them. Open also means many people can provide information with them,” said Birney. “In something like genomics, where there’s a lot of innovation — for example, single cell data, or long read data — we also need to be able to adapt to the way the science is changing and that means being responsive to the community.”
GA4GH represents the next generation of open genomics resources; all of its standards and deliverables, including CRAM, are developed by the community that uses them and are freely available to all.
“CRAM is not tied to any specific sequencing instrument, so it can be used for both long and short read experiments,” said Bonfield. “And it’s also not tied to any specific type of experiment, so it can be used for both whole genome shotgun, or exome sequencing, or any of the other targeted sequencing experiments.”
While agnostic to instruments and experiments, CRAM is intimately connected to the other GA4GH standards such as htsget, for streaming sequence data, and refget, for retrieving the reference genome. Each of these was built upon existing standards such as BAM and therefore works seamlessly with CRAM. Similarly, CRAM files can be easily integrated into institutional pipelines and workflows because they are interchangeable with the files already being used.
GA4GH brings together the international community currently engaged in research and healthcare based genomics, from major genomics initiatives embedded within national health systems, such as Genomics England, to large scale research programs such as the NIH All of Us program. By convening this diverse group to co-develop standards such as CRAM, the community can ensure its tools are suitable for everybody — not just one individual or institution.