Genomic Data Toolkit


Access and adopt ready-to-use Genomic Data for genomic data sharing below or download the full 5-year GA4GH Connect Strategic Plan.

Data Use Ontology v1

A GA4GH-approved Standard The GA4GH Data Use Ontology (DUO) allows users to semantically tag genomic datasets with usage restrictions, allowing them to become automatically discoverable based on a health, clinical, or biomedical researcher’s authorization level or intended use. DUO is based on the OBO Foundry principles and developed using the W3C Web Ontology Language. It is being used in production by the European Genome-phenome Archive (EGA) at EMBL-EBI/CRG as well as the Broad Institute for the Data Use Oversight System (DUOS).

contributors

Available resources

GA4GH Passports v1

A GA4GH-approved Standard The GA4GH Passport specification aims to support data access policies within current and evolving data access governance systems. This specification defines Passports and Passport Visas as the standard way of communicating the data access authorizations that a user has based on either their role (e.g. researcher), affiliation, or access status.

contributors

Available resources

Data Repository Service v1

A GA4GH-approved Standard ​The Data Repository Service (DRS) API, a standard for building data repositories and adapting access tools to work with those repositories, works with other approved APIs from the GA4GH Cloud Work Stream to allow researchers to discover algorithms across different cloud environments and send them to datasets they wish to analyze. The API allows data consumers to access datasets regardless of the repository in which they are stored or managed.

contributors

Available resources

Data Repository Service v1

A GA4GH-approved Standard ​The Data Repository Service (DRS) API, a standard for building data repositories and adapting access tools to work with those repositories, works with other approved APIs from the GA4GH Cloud Work Stream to allow researchers to discover algorithms across different cloud environments and send them to datasets they wish to analyze. The API allows data consumers to access datasets regardless of the repository in which they are stored or managed.

contributors

Available resources

Tool Registry Service API v2

A GA4GH-approved Standard ​The Tool Registry Service (TRS) is a standard API for exchanging tools and workflows to analyze, read, and manipulate genomic data. The ​TRS API is one of a series of technical standards from the Cloud Work Stream that together allow genomics researchers to bring algorithms to datasets in disparate cloud environments, rather than moving data around. TRS gives researchers access to far more tools than they can presently use, and allows developers to register their products so that they are visible on a multitude of sites, expanding their audience reach. The API also provides a set of requirements for tool and workflow registries to implement TRS.

contributors

Available resources

Variation Representation v1

A GA4GH-approved Standard ​The ​Variation Representation (VR) specification​ provides a flexible framework of computational models, schemas, and algorithms to precisely and consistently exchange genetic variation data across communities. The specification, which was developed with input from national information resource providers, major public initiatives, and diagnostic testing laboratories, significantly reduces ambiguity in exchanging variation data. In this way, VR aims to improve the reliability and utility of the clinical annotations that are central to personalized medicine. The VR specification consists of five key components that together produce a reliable way of describing and transferring genetic variation data: an extensible terminology and information model, a machine-readable schema, conventions for data normalization, globally unique computed identifiers, and a python implementation.

contributors

Available resources

Crypt4GH v1

A GA4GH-approved Standard ​By its nature, genomic data can include information of a confidential nature about the health of individuals. It is important that such information is not accidentally disclosed. One part of the defense against such disclosure is to keep the data in an encrypted format as much as possible. Crypt4GH is a file format that can be used to store data in an encrypted and authenticated state. Existing applications can, with minimal modification, read and write data in the encrypted format. The choice of encryption also allows the encrypted data to be read starting from any location, facilitating indexed access to files.

contributors

Available resources

Beacon API v1

A GA4GH-approved Standard The Beacon API can be implemented as a web-accessible service that users may query for information about a specific allele. A user of a Beacon can pose the query “Have you observed this nucleotide (e.g. C) at this genomic location (e.g. position 32,936,732 on chromosome 13)?” to which the Beacon responds with either “yes” or “no”. The new release of the Beacon API extends its functionality through support for additional types of genomic variants and improved metadata support. Additionally, the accompanying ELIXIR Beacon reference implementation demonstrates ELIXIR Authorization and Authentication Infrastructure (AAI), enabling data owners to light Beacons at different tiers of data access: public, registered, or controlled.

contributors

CRAM File Format v3

The CRAM file format is an efficient storage format for read data, achieving significantly better lossless compression than BAM, whilst maintaining full compatibility. To learn more about the benefits of the CRAM file format for genomic data compression, visit here.

contributors

Available resources

Family History Tools Inventory

The Family History Tool Inventory is a catalogue of family history tools currently available for documenting family health history information. The Statement of Best Practice highlights current approaches and challenges in enabling family history to guide clinical care to developers of clinically-oriented family history collection systems, including stand alone and EHR-integrated systems. The inventory will be updated periodically and we encourage recommendations of other tools to include. To recommend a tool, please email info@ga4gh.org.

Contributors

htsget API v1

A GA4GH-approved Standard htsget is a genomic data retrieval specification that allows users to download read data for subsections of the genome in which they are interested. Currently, users must download the whole set of files in which that data resides, a slow, resource-intense process.

contributors

refget API v1

A GA4GH-approved Standard All sequencing-based genomic analysis uses a genomic “reference sequence” — a baseline of knowledge against which variations are observed. There are multiple human reference sequences of increasing accuracy and different organizations refer to the same sequence using different names or reuse names to refer to different reference releases. Reliable, reproducible genomic analysis depends on clear provenance back to reference data. The GA4GH refget API enables access to reference genomic sequences without ambiguity from different databases and servers using a checksum identifier based on the sequence content itself.

contributors

SAM/BAM File Formats v1

Specifications for storing next-generation sequencing read data.

contributors

Available resources

Variant Benchmarking Tools

Standardized benchmarking methods and tools are essential to robust accuracy assessment of next generation sequencing variant calling. Benchmarking variant calls requires careful attention to definitions of performance metrics, sophisticated comparison approaches, and stratification by variant type and genome context. The germline small variant benchmarking tools address challenges in (1) matching variant calls with different representations, (2) defining standard performance metrics, (3) enabling stratification of performance by variant type and genome context, and (4) developing and describing limitations of high-confidence calls and regions that can be used as “truth”. They have been piloted in the precisionFDA variant calling challenges to identify the best-in-class variant calling methods within high-confidence regions.

contributors

Available resources

VCF v4 / BCF v2 File Formats

The specifications for Variant Call Format Files (VCF) and its binary counterpart BCF.

contributors

Available resources

Workflow Execution Service (WES) API v1

A GA4GH-approved Standard Portable tools — the ability to execute a single analysis in a variety of environments — allow researchers to work with more data from more sources, and tool builders to support more researchers and more use cases. The Workflow Execution Service (WES) API provides a standard for exactly that. This API lets users run a single workflow (defined using CWL or WDL) on multiple different platforms, clouds, and environments, and be confident that it will work the same way. The API provides methods to request that a workflow be run, pass parameters to that workflow, get information about running workflows, and cancel a running workflow.

contributors

Available resources