Now open for comment: GA4GH Data Use Ontology


The GA4GH Data Use Ontology (DUO) allows users to semantically tag genomic datasets with usage restrictions, allowing them to become automatically discoverable based on a health, clinical, or biomedical researcher’s authorization level or intended use. DUO is based on the OBO Foundry principles and developed using the W3C Web Ontology Language. It is being used in production by the European Genome-phenome Archive (EGA) at EMBL-EBI/CRG as well as the Broad Institute for the Data Use Oversight System (DUOS).

Human subjects datasets often have restrictions such as “only available for cancer use” or “only available for the study of pediatric diseases,” deduced from the original biospecimen collection informed consent form, which must be respected when sharing and studying these datasets. Each institution uses unique language in their informed consent forms to describe secondary use restrictions and conditions. DUO is a standard universal system to categorize these conditions, with an aim to allow data access committees and researchers to interpret the conditions in a consistent, structured way.

DUO represents data use terms from three evolving efforts to standardize data use restrictions in the biomedical and genomics research domain:

  • NIH database of Genotype and Phenotype (dbGaP) data use categories. dbGaP is one of the largest public repositories of genomics data in the world
  • Consent Codes  – a global effort led by Stephanie OM Dyke (McGill University) and the GA4GH Regulatory and Ethics Work Stream to define ‘codes’ for specific categories of data use restrictions based on the datasets of the main public genome archives (NCBI dbGaP and EMBL-EBI/CRG EGA).
  • The Automated Data Access Matrix (ADA-M) – work led by Anthony Brookes and other GA4GH members of the ADA-M task team to define a matrix of data use categories that can be used to define data use restrictions and research purpose.

DUO is an evolving effort to provide digital ontological representation for all the data use categories defined by the efforts mentioned above. Its evolution is being led by GA4GH Driver Projects such as the EMBL-EBI/CRG EGA where it is currently used in production, the All of Us research program and the NIH Data Commons Pilot.

DUO has been submitted for product approval by the GA4GH Steering Committee as of January 15, 2019 and is open for public comment until February 15, 2019. Technical comments are invited via the GitHub issue tracker, general comments should be sent to the GA4GH Data Use mailing-list.