Learn how GA4GH helps expand responsible genomic data use to benefit human health.
Learn how GA4GH helps expand responsible genomic data use to benefit human health.
Our Strategic Road Map defines strategies, standards, and policy frameworks to support responsible global use of genomic and related health data.
Discover how a meeting of 50 leaders in genomics and medicine led to an alliance uniting more than 5,000 individuals and organisations to benefit human health.
GA4GH Inc. is a not-for-profit organisation that supports the global GA4GH community.
To guide our collaborative, globe-spanning alliance, GA4GH relies on a Standards Steering Committee and an Executive Committee.
The Funders Forum brings together organisations that offer both financial support and strategic guidance.
The EDI Advisory Group responds to issues raised in the GA4GH community, finding equitable, inclusive ways to build products that benefit diverse groups.
Distributed across four Host Institutions, our staff team supports the mission and operations of GA4GH.
Curious who we are? Meet the people and organisations across six continents who make up GA4GH.
More than 500 organisations connected to genomics — in healthcare, research, patient advocacy, industry, and beyond — have signed onto the mission and vision of GA4GH as Organisational Members.
These core Organisational Members are genomic data initiatives that have committed resources to guide GA4GH work and pilot our products.
This subset of Organisational Members whose networks or infrastructure align with GA4GH priorities has made a long-term commitment to engaging with our community.
Local and national organisations assign experts to spend at least 30% of their time building GA4GH products.
Anyone working in genomics and related fields is invited to participate in our inclusive community by creating and using new products.
Wondering what GA4GH does? Learn how we find and overcome challenges to expanding responsible genomic data use for the benefit of human health.
Study Groups define needs. Participants survey the landscape of the genomics and health community and determine whether GA4GH can help.
Work Streams create products. Community members join together to develop technical standards, policy frameworks, and policy tools that overcome hurdles to international genomic data use.
GIF solves problems. Organisations in the forum pilot GA4GH products in real-world situations. Along the way, they troubleshoot products, suggest updates, and flag additional needs.
NIF finds challenges and opportunities in genomics at a global scale. National programmes meet to share best practices, avoid incompatabilities, and help translate genomics into benefits for human health.
Communities of Interest find challenges and opportunities in areas such as rare disease, cancer, and infectious disease. Participants pinpoint real-world problems that would benefit from broad data use.
See all our products — always free and open-source. Do you work on cloud genomics, data discovery, user access, data security or regulatory policy and ethics? Need to represent genomic, phenotypic, or clinical data? We’ve got a solution for you.
All GA4GH standards, frameworks, and tools follow the Product Development and Approval Process before being officially adopted.
Learn how other organisations have implemented GA4GH products to solve real-world problems.
Help us transform the future of genomic data use! See how GA4GH can benefit you — whether you’re using our products, writing our standards, subscribing to a newsletter, or more.
Help create new global standards and frameworks for responsible genomic data use.
Align your organisation with the GA4GH mission and vision.
Solve your real-world data problems with support from this valuable network of global institutions.
Work with like-minded groups committed to better data use in areas like rare disease, cancer, and infectious disease.
Share your thoughts on all GA4GH products currently open for public comment.
Solve real problems by aligning your organisation with the world’s genomics standards. We offer software dvelopers both customisable and out-of-the-box solutions to help you get started.
Learn more about upcoming GA4GH events. See reports and recordings from our past events.
Speak directly to the global genomics and health community while supporting GA4GH strategy.
Be the first to hear about the latest GA4GH products, upcoming meetings, new initiatives, and more.
Questions? We would love to hear from you.
Read news, stories, and insights from the forefront of genomic and clinical data use.
Attend an upcoming GA4GH event, or view meeting reports from past events.
See new projects, updates, and calls for support from the Work Streams.
Read academic papers coauthored by GA4GH contributors.
Listen to our podcast OmicsXchange, featuring discussions from leaders in the world of genomics, health, and data sharing.
Check out our videos, then subscribe to our YouTube channel for more content.
View the latest GA4GH updates, Genomics and Health News, Implementation Notes, GDPR Briefs, and more.
Discover all things GA4GH: explore our news, events, videos, podcasts, announcements, publications, and newsletters.
25 Mar 2019
CRAM, the data compression standard for genomics, is quickly transitioning to the field’s preferred file format for storing sequence reads.
For most of its life, the field of genomics has mostly been a field of research. But as the cost of sequencing plummets and its utility for diagnosing, treating, and preventing disease becomes more apparent, genomics is transitioning into a field of healthcare. With this shift, which comes with new funding mechanisms and larger participant cohorts, the amount of genomic data requiring storage capacity is rapidly increasing.
The Global Alliance for Genomics and Health (GA4GH) has estimated that by 2025, 60 million human genomes will have been sequenced around the globe, with the majority of those coming from healthcare. Each human genome brings with it 30-fold more data as a result of errors in sequencing, base calling, and genome alignment. So while a single genome of 3 billion base pairs could theoretically be stored as .75 Gb (4 nucleotide bases per byte), actual storage requirements are much greater. Coupled with an assortment of metadata types (such as sequence names and quality values), this could be anywhere from 2 to 40 exabytes of data storage per year.
Compared to astronomy (1 exabyte/year), particle physics (75+ petabytes/year), and YouTube (1-2 exabytes/year) — three of history’s most prolific data generators to date — genomics is quickly becoming one of the world’s most storage-heavy industries.
Traditionally, sequence reads are stored in one of two file types: a human-readable text file of sequence data called SAM, or its binary counterpart, BAM. Together with formats for storing variant call data (VCF/BCF), SAM and BAM are maintained by the GA4GH Large Scale Genomics Work Stream and therefore represent the community standards for genomic data storage. Collectively they have been cited more than 20,000 times. But as datasets soar in size and number, institutions need a more efficient data compression mechanism that is interoperable with the field’s existing tools and best practices.
Enter CRAM: the compressed version of BAM. Also maintained by GA4GH, CRAM is quickly transitioning to the field’s preferred file format for storing sequence reads.
While BAM files are considerably smaller than SAM files, they use general purpose compression methods such as ZIP, which only get us so far. “There’s a good reason we don’t use ZIP for images — we use custom algorithms written for that task, such as JPEG or PNG,” said James Bonfield, principal software developer at the Wellcome Sanger Institute. “This is what CRAM is doing. CRAM is a custom algorithm written to compress the BAM data to a much smaller size.”
CRAM uses two primary compression methods. First, it aligns the sequence data to a reference, and only stores the data that is different rather than the whole genome. It also breaks the dataset that comes off a sequencer into its component parts (eg., sequence names, genetic sequences, quality values, etc.). Each of these is compressed independently using an algorithm specific to the data type.
CRAM has been implemented into the world’s most widely used software libraries for genomics (htslib and htsjdk) and is being adopted by leading genomics institutions around the globe including Genomics England, the Broad Institute of MIT and Harvard, H3Africa, Illumina, Inc., Sweden’s National Genomics Infrastructure, and EMBL’s European Bioinformatics Institute (EMBL-EBI) — and many more — which report storage and cost savings of up to 50 percent as a result.
“Even if an organization is not producing large volumes of data, it can be very beneficial to use the same file formats that other organizations are using,” said Bonfield. “CRAM can help in this regard and we don’t have to have lots of different formats and lots of different standards.”
“It would be really impossible to do the science that we do at the scale that we do it, if we did not agree on the basic standards of our data,” said Ewan Birney, director of EMBL-EBI and chair of the Global Alliance for Genomics and Health (GA4GH).
Since the early days of the Human Genome Project, the genomics community has made a deliberate commitment to making its data open in order to afford most benefit for humanity. To adhere to this commitment, open standards and open software for analysing that data must also be available.
“Standards are nearly always best open. Open means that many people can implement them and use them. Open also means many people can provide information with them,” said Birney. “In something like genomics, where there’s a lot of innovation — for example, single cell data, or long read data — we also need to be able to adapt to the way the science is changing and that means being responsive to the community.”
GA4GH represents the next generation of open genomics resources; all of its standards and deliverables, including CRAM, are developed by the community that uses them and are freely available to all.
“CRAM is not tied to any specific sequencing instrument, so it can be used for both long and short read experiments,” said Bonfield. “And it’s also not tied to any specific type of experiment, so it can be used for both whole genome shotgun, or exome sequencing, or any of the other targeted sequencing experiments.”
While agnostic to instruments and experiments, CRAM is intimately connected to the other GA4GH standards such as htsget, for streaming sequence data, and refget, for retrieving the reference genome. Each of these was built upon existing standards such as BAM and therefore works seamlessly with CRAM. Similarly, CRAM files can be easily integrated into institutional pipelines and workflows because they are interchangeable with the files already being used.
GA4GH brings together the international community currently engaged in research and healthcare based genomics, from major genomics initiatives embedded within national health systems, such as Genomics England, to large scale research programs such as the NIH All of Us program. By convening this diverse group to co-develop standards such as CRAM, the community can ensure its tools are suitable for everybody — not just one individual or institution.