1 November 2018
The first step of any genomic analysis is mapping the new sequence data to a reference sequence — a list of 3 billion base pairs that have been generally accepted as “normal” for a given population or subgroup. “Variation is defined as a difference from the reference,”said Andy Yates, Team Leader of the Genomics Technology Infrastructure at EMBL-EBI.
But in order to do that mapping, scientists and clinicians need something to compare a new sequence against. “A reference point, somewhere that we can ground ourselves,” Yates said. “The reference genome and reference sequences provide that grounding.”
So what happens if the reference sequence you think you’re using turns out not to be? In short, the entire genomics analysis pipeline falls apart. If you can’t trust your reference sequence, variants will be missed and normal regions will mis-classified; researchers will draw conclusions based on bad data and clinicians will make diagnoses and treatment plans based on false variant classifications and interpretations.
It seems like an easy enough problem to avoid: just agree on the same name for every reference sequence and then everyone will know which one they’re using. There are just two problems: getting scientists to agree on nomenclature is easier said than done and ensuring everyone uses the right name is next to impossible.
“Just because something says it is doesn’t mean it truly is,” said Yates. “So actually what we need is a way of verifying their integrity.”
Refget, a new API from the Large Scale Genomics Work Stream does just that using an old standby trick in computer science called “checksums.” These are small algorithms that digest a stream of data and derive an identifier, which can be used to verify that the underlying reference sequence is the one the user expects.
“Here’s a little example of the idea: we have a reference sequence, we normalize that sequence, we calculate a checksum, and out pops a unique string identifier for that given sequence,” said Yates. “Refget is a method to retrieve a sequence by its derived checksum.”
The API is not limited to genomic sequence data: it can also be used to reference chromosomes, proteins, and transcripts. Each time the API is deployed, the user chooses their data type. Currently, two live implementations have been deployed and both have been tested to demonstrate accuracy: “they return the exact same sequence,” said Yates. This interoperability compliance report is also available for users to track that their own implementation is working properly.
In future work, LSG plans to develop a method for verifying bundles of references, whether genome, transcriptome, or proteome and to distribute the API in a cloud environment so that users can access references around the globe without having to connect to an on-site server at any given institution.
Importantly, refget is a foundational GA4GH API as it underpins several other tools in the genomic data toolkit, including htsget and the CRAM file format.
To learn more about the refget API and how you can implement it in your own pipelines, please join Yates for a live webinar on December 4 at 4pm GMT. Register here. To get involved with future developments of the the API, contact LSG Program Manager Rishi Nag.