GA4GH releases refget API for accessing genomic reference sequence data

News

1 Nov 2018

GA4GH releases refget API for accessing genomic reference sequence data

1 Nov 2018

Refget, a new API from the Large Scale Genomics Work Stream retrieves genomic reference sequences using “checksums” — small algorithms that tag a bit of data with an identifier that can be used to verify its integrity.

The first step of any genomic analysis is mapping the new sequence data to a reference sequence — a list of 3 billion base pairs that have been generally accepted as “normal” for a given population or subgroup. “Variation is defined as a difference from the reference,”said Andy Yates, Team Leader of the Genomics Technology Infrastructure at EMBL-EBI.

But in order to do that mapping, scientists and clinicians need something to compare a new sequence against. “A reference point, somewhere that we can ground ourselves,” Yates said. “The reference genome and reference sequences provide that grounding.”

So what happens if the reference sequence you think you’re using turns out not to be? In short, the entire genomics analysis pipeline falls apart. If you can’t trust your reference sequence, variants will be missed and normal regions will mis-classified; researchers will draw conclusions based on bad data and clinicians will make diagnoses and treatment plans based on false variant classifications and interpretations.

It seems like an easy enough problem to avoid: just agree on the same name for every reference sequence and then everyone will know which one they’re using. There are just two problems: getting scientists to agree on nomenclature is easier said than done and ensuring everyone uses the right name is next to impossible.

“Just because something says it is doesn’t mean it truly is,” said Yates. “So actually what we need is a way of verifying their integrity.”

Refget, a new API from the Large Scale Genomics Work Stream does just that using an old standby trick in computer science called “checksums.” These are small algorithms that digest a stream of data and derive an identifier, which can be used to verify that the underlying reference sequence is the one the user expects.

“Here’s a little example of the idea: we have a reference sequence, we normalize that sequence, we calculate a checksum, and out pops a unique string identifier for that given sequence,” said Yates. “Refget is a method to retrieve a sequence by its derived checksum.”

The API is not limited to genomic sequence data: it can also be used to reference chromosomes, proteins, and transcripts. Each time the API is deployed, the user chooses their data type. Currently, two live implementations have been deployed and both have been tested to demonstrate accuracy: “they return the exact same sequence,” said Yates. This interoperability compliance report is also available for users to track that their own implementation is working properly.

In future work, LSG plans to develop a method for verifying bundles of references, whether genome, transcriptome, or proteome and to distribute the API in a cloud environment so that users can access references around the globe without having to connect to an on-site server at any given institution.

Importantly, refget is a foundational GA4GH API as it underpins several other tools in the genomic data toolkit, including htsget and the CRAM file format.

To learn more about the refget API and how you can implement it in your own pipelines, please join Yates for a live webinar on December 4 at 4pm GMT. Register here. To get involved with future developments of the the API, contact LSG Program Manager Rishi Nag.

Related Work Streams

Large-Scale Genomics (LSG) Work Stream

Latest News

24 Jun 2025

GA4GH and CRDSA agree to a Strategic Partnership

A doctor writing on iPad with health data and global connections coming out of the pen.

17 Jun 2025

Policy Brief: will the UK participate in the European Health Data Space?

A colorful strand of DNA set against images of a patient health record, a database, and a magnifying glass.

12 Jun 2025

GA4GH approves two new products: Categorical Variation Representation Specification (Cat-VRS) and Variant Annotation Specification (VA-Spec)

See all news and events

About us

About us

Strategic Road Map

History

GA4GH Inc.

Leadership

Funders Forum

Equity, Diversity, and Inclusion (EDI) Advisory Group

Staff

Our community

Our community

Organisational Members

Driver Projects

Strategic Partners

Assigned Experts

Individual Contributors

What we do

What we do

Study Groups

Work Streams

GA4GH Implementation Forum

National Initiatives Forum

Communities of Interest

Technical Alignment Subcommittee (TASC)

Calendar

Our products

Our products

Product Development and Approval Process

Implementations

Get involved

Get involved

Join us

Open calls

Implement a product

Attend an event

Become a funder

Subscribe to the GA4GH newsletter

Contact us

News and events

News

Events

Announcements

Publications

Podcasts

Videos

Newsletters

See all

News