Phenopackets: Standardizing and Exchanging Patient Phenotypic Data

Image Credit: Stephanie Li, GA4GH

More than 60 million genomes are expected to be sequenced for healthcare purposes over the next five years. This mass of data has the potential to inform human health and medicine in unprecedented ways, but that promise will only be realized if the data can be shared across disciplines and effectively linked to clinical outcomes. 

The majority of existing formats for describing genotype information do not include a means to share corresponding phenotypic information (e.g. observable characteristics, signs/symptoms of disease). While some genomic databases have defined their own formats for representing phenotypic information, the lack of uniformity amongst these organizations hinders communication and limits the ability to perform analyses across them.

The GA4GH Steering Committee recently approved Phenopackets, a standard file format for sharing phenotypic information. The Phenopackets standard aims to facilitate communication between the research and clinical genomics communities by creating an ecosystem of interoperable tools and resources that can use phenotypic data with fewer barriers. 

A phenopacket file contains a set of mandatory and optional fields to share information about a patient or participant’s phenotype, such as clinical diagnosis, age of onset, results from lab tests, and disease severity. It is also able to link to a separate file containing a patient’s genetic sequence, if available. Phenopackets are expected to standardize phenotypic data exchange within the medical and scientific settings. This will allow phenotypic data to flow between clinics, databases, clinical labs, journals, and patient registries in ways currently only feasible for more quantifiable data, like sequence data.

“Phenotype data is, by its nature, complex due to the wide array of modalities used to capture trait information,” said EMBL-EBI Bioinformatician Terry Meehan, who has implemented Phenopackets within the International Mouse Phenotyping Consortium (IMPC). “This complexity leads to challenges in data interoperability as differing languages are used between biomedical databases to describe similar results—a serious bottleneck in translating research for clinicians.”

The standard is of significant relevance to the rare disease and cancer communities, in which clinical data—such as lab test results, physical attributes, or disease progression and severity—are often used to differentiate between conditions that share similar phenotypes.

“Phenopackets will greatly simplify representation and exchange of phenotypic information, opening the door for matching rare disease patients in federated query systems supported by GA4GH,” said Metadata Standards Coordinator at EMBL-EBI, Melanie Courtot, who is leading implementation of the Phenopackets standard within the BioSamples database.

Using Phenopackets, clinicians can search through genetic variants that produce similar phenotypes and determine which one best matches their patient. Overall, such matching supports better and faster diagnosis and treatment, and higher chances of remission. Phenopackets also benefit researchers by opening up opportunities to analyze more data and strengthen our understanding of human health and disease. 

“Clinicians and researchers with varying degrees of genomics expertise will find the file format useful,” said Melissa Haendel, Principal Investigator for the Monarch Initiative and Lead of the GA4GH Clinical & Phenotypic Data Capture Work Stream. “Phenopackets provide different levels of complexity so that we can exchange both high-level clinical phenotype information as well as in-depth data.” For instance, the standard can be used to describe anything from abnormal fetal movement or decreased white blood cell count to eye color or height.

Most of the fields within the file are optional, giving clinicians and researchers freedom to report only the phenotypic information they choose. If specific lab tests are not administered or a patient’s whole genome is not sequenced, those data do not need to be included in a phenopacket that stores other related information. This flexibility will also allow for the omission of identifiable information, such as date of birth or name, to preserve patient privacy.

To read a phenopacket file, researchers and clinicians can utilize existing software, such as Phenotools (for validating Phenopackets) and Exomiser (for annotating variants). 

Peter Robinson, a computational biologist and pediatric physician at the Jackson Laboratory, leads Phenopackets development. Robinson notes that the team hopes to soon release a guide for implementing phenopackets within electronic health records built on the HL7 FHIR framework (the leading standard for storing electronic health data) in order to drive uptake among the clinical community. The development team is also working with journals to require phenotype data to be submitted in the Phenopacket format, which will encourage research scientists to adopt this standard into practice. 

“Phenopackets enable a massive network of genomic data sharing, not only within the research or clinical communities, but also between the two groups,” said Robinson. “Now researchers can use patient phenotype information to further their understanding of human biology, and clinicians can reap the benefits of research findings in healthcare.”