News

Variation Representation: a standard way of exchanging genetic variation data with precision and consistency


Image Credit: Stephanie Li, GA4GH

Between any two individuals, the human genome differs by around 0.1 percent—meaning each individual has millions of sites in their genome that makes them unique. The impact of genetic variation on human health and disease is complex and immense; in order to accelerate research and advance patient care, it is vital to maximize the value of the world’s genetic variation data and exchange it reliably across diagnostic labs, electronic health records (EHRs), research institutions, and knowledgebases.

The GA4GH Variation Representation (VR) specification, produced by the Genomic Knowledge Standards Work Steam, provides a flexible framework of computational models, schemas, and algorithms to precisely and consistently exchange genetic variation data across communities. The specification, which was developed with input from national information resource providers, major public initiatives, and diagnostic testing laboratories, significantly reduces ambiguity in exchanging variation data. In this way, VR aims to improve the reliability and utility of the clinical annotations that are central to personalized medicine.

“The genomics community lacks a robust standard that enables the unambiguous exchange of genomic data,” said Robert Freimuth, co-lead of the Genomic Knowledge Standards Work Stream and Assistant Professor of Biomedical Informatics in the Division of Digital Health Sciences at the Mayo Clinic. “The VR-Spec is a step toward filling the gap between the exchange mechanisms used by the research, translational, and clinical communities, which is necessary for the implementation of genomic and precision medicine.” 

“VR reduces the ambiguity in the way we exchange sequence variation data,” said Reece Hart, a software engineering consultant and principal author of the specification. “By reducing the ambiguity, we improve the precision with which we use genetic variation in research and clinical settings. ”

With common needs across all variant data repositories—such as designating two or more variants as functionally-equivalent—there is a strong need for encoding community-driven best practices in variant representation. 

“In practice, each different repository has made its own decisions fairly independently, with no best practices emerging,” said UCSC bioinformatician Melissa Cline, a Driver Project Champion of the BRCA Exchange and an implementer of the VR specification. “The VR specification provides us with a common vocabulary that we can now build upon as a community to define these best practices.”

Alex Wagner, an instructor at Washington University School of Medicine who co-leads the Variant Interpretation for Cancer Consortium, also co-led VR’s development. “We have designed the specification with intent to capture difficult variation concepts, such as structural variation, in a precise, computable way to improve cross-platform search,” said Wagner. “A shared, unified model for these variation forms will be a tremendous asset for normalizing knowledge used in precision oncology.”

The VR specification consists of several key components that together produce a reliable way of describing and transferring genetic variation data:

  1. An Extensible Terminology and Information Model provides researchers, clinicians, and testing laboratories with a shared foundation for bridging patient findings with existing knowledgebases. It offers standard computational data structures for precise biological concepts, such as “allele,” “sequence,” “variation,” and “genotype”. The model is readily extensible—more complex descriptions of biological variation, such as haplotypes, genotypes, categorical variation, and structural variation models, are already underway.
  2. A Machine Readable Schema to structure genetic variation data for electronic data exchange allows investigators to perform analyses with more clarity and ease. While the current schema is JSON based, it is intended to be neutral with respect to languages and databases.
  3. Conventions for data normalization allows users to compare and interpret data sets collected at different institutions. For alleles, the VR specification recommends a normalization process inspired by the NCBI’s SPDI project, which adjusts the sequence position to account for any ambiguity resulting from insertion or deletion alleles.
  4. Globally unique computed identifiers enable distributed groups to refer to a specific genetic variation without any prior coordination. Through a series of algorithms that serialize and condense the data into a digestible form, the unique identifier facilitates data exchange and can be used in secure environments.
  5. A python implementation demonstrates the above models and algorithms in action. To enable interoperability, the python package supports translation of existing variation representation schemas into the VR specification (and vice versa) for use in genomic data sharing. 

“Genetic variation is complex. Currently, no single model exists to represent and exchange this data in a consistent, reliable manner,” said product lead Larry Babb, a senior principal software engineer at the Broad Institute of MIT and Harvard. “By addressing this issue, the specification will allow different communities to ‘speak the same language’— whether it’s diagnostic labs and EHR vendors who are collecting samples or investigators who are accessing them.”

The ClinGen Allele Registry, which provides unique variant identifiers for the community, has developed an initial implementation for the VR specification. “Wide adoption of the VR specification will provide a means of collating information about variants [at scale],” according to a statement by Allele Registry authors Aleks Milosavljevic and Ronak Y. Patel of the Baylor College of Medicine. “This capability has so far been lacking and is synergistic with the approach of the Allele Registry and other web services, which support the linking of information on almost one billion distinct variants.”  

The VR specification aims to maximize the value of the world’s shared genetic variation data. In addition to building computational models for more complex types of variation, the team has future plans to describe new types of intervals and locations, as well as develop additional concepts for representing state and variation.