GIF Spotlight: GA4GH standards in action at Ensembl

15 Apr 2026

GA4GH Implementation Forum (GIF) Spotlights showcase real-world implementations of GA4GH standards. This Spotlight showcases Ensembl’s implementation of the GA4GH refget standard to advance data access and availability.

Ensembl logo

By Andy Yates (EMBL’s European Bioinformatics Institute)

Ensembl is a reference genome resource, supporting clinical and basic research across the tree of life including humans, crops, animals and bacteria, amongst others. This Spotlight highlights the Ensembl project as an implementor of GA4GH standards. By demonstrating the wide applicability of GA4GH’s products beyond clinical settings, we show how they can support human health in other contexts, such as food security and antimicrobial resistance.

Approach to standards implementation

The goal of the project is to standardise data access and availability within the Ensembl resource.

1. Standardising access to reference sequences

Reference sequences are the bedrock of resources such as Ensembl. Annotations are based upon the genomes of organisms and those same annotations will create entities such as cDNAs and protein sequences. As part of an infrastructure refresh, we co-developed and adopted refget as the primary mechanism to retrieve sequences. All Ensembl-hosted sequences are available from the refget Server, with each genome and mature product annotated with MD5 and sha512t24u checksums. As part of this effort, we developed Ensembl refget proxy and Ensembl refget. Ensembl refget Proxy allows multiple refget servers to be presented as a single coherent implementation. Ensembl refget is a fast and scalable implementation of the refget protocol.

 View more information about Ensembl’s refget offering

2. Standardising variant representation and annotation results

In addition to providing refget offerings, Ensembl also:

View Ensembl’s Beacon service.

Successes, challenges, and lessons learned

Refget continues to provide sequences for our new infrastructure. Originally, we adopted the refget reference and ENA refget implementation both fronted by Ensembl refget Proxy. However, we encountered limitations in both in this solution. The reference implementation did not scale due to its use of traditional relational database management systems for sequence storage and the high cardinality of sequence lengths. Loading times were too high and retrieval similarly expensive. We also found the ENA refget implementation whilst comprehensive did not work well with our developed proxy solution. These led us to reimplementing the standard and utilising compressed indexable files to provide data mitigating issues around sequence lengths and enable fast access to sequences.

The GA4GH reference implementation provided a useful proof of concept and demonstrated refget was a good fit for our needs. However, it was not capable of achieving the quality of service we needed. 

In addition, the standard is very amenable to reimplementation. The permanence of the generated URLs means new implementations can be presented without changing anything downstream. This would be useful if providing backups or failovers.

If you are interested in learning more about the project, you may contact helpdesk@ensembl.org.

Related Groups

Latest News

Ensembl logo
GIF Spotlight
15 Apr 2026
GIF Spotlight: GA4GH standards in action at Ensembl
See more
2025 Annual Report: The rising tide of genomics and health data
News
10 Apr 2026
GA4GH publishes its 2025 Annual Report
See more
News
7 Apr 2026
GA4GH launches new Work Stream to support responsible AI in genomics and health
See more