GDPR Brief: when are synthetic health data personal data?

11 Apr 2024

There is growing enthusiasm for synthetic data as a means of enabling meaningful data analysis while preserving privacy. However, the technical literature identifies threats to privacy for some forms of synthetic data when generated using personal data. In this brief, we discuss the key question of whether/when synthetic data may constitute personal data, set out the approach that data controllers should adopt in answering such questions and explain why general legal uncertainty is likely to remain for some time.

Neon DNA strands intertwining with digital code symbolising the fusion of biology and technology

What is synthetic data?

Synthetic data can be thought of as ‘artificial data that closely mimic the properties and relationships of real data’. While there is considerable interest in synthetic data as a form of privacy enhancing technology (PET), it has other important potential uses such as to fill gaps in real-world data, correct bias in datasets, and enable cohort planning, code development or other technical applications. For example, synthetic genomic data has been generated by the Common Infrastructure for National Cohorts in Europe, Canada, and Africa Project (CINECA) to showcase and demonstrate federated research and clinical applications. 

Synthetic data is also an umbrella term. It can be generated in multiple ways and for varying purposes and consequently, will result in different forms with different privacy implications (see Table below).

Generation Method Synthetic data generation methods can be grouped into three main categories: statistical, noise-based and machine learning methods. Some require manual manipulation of the data, others add noise to disguise personal data, and others learn the complex relationships in the real data to map those correlations onto entirely ‘made up’ but biologically realistic data values. The output synthetic data can therefore vary in whether it is fully or only partially synthetic. All approaches may require manual review and oversight to assess accidental and coincidental matches.
Privacy Preservation Techniques As with other data, privacy preservation techniques may be applied to synthetic data. The characteristics (i.e., distribution, size etc.,) and intended use of the data will determine the most appropriate technique. For example, differential privacy (which adds noise to distort the true data values) is better suited to large datasets but this technique reduces accuracy (i.e., ‘fidelity’) and therefore its utility in some contexts.
Context and Use Limitations Technical and organisational restrictions on access to synthetic data and the tools available when processing them may be applied to further limit privacy risks. As with other forms of data, care should be taken when repurposing or changing the context of synthetic data to ensure data remain fit for purpose and sufficiently safeguarded in context. 

When will synthetic data amount to personal data?

The key question is whether synthetic data fall within the material scope of data protection law as ‘personal data’ (Article 4(1) GDPR). Due to the wide range of synthetic data generation methods, outputs and uses there can be no one-size-fits-all response but there are some indications of the approach and relevant factors that courts and regulatory bodies will adopt in determining an answer. 

In our recent review of emerging statements and guidance from authorities in Europe we found that data protection authorities are currently approaching synthetic data in a similar, “orthodox” way. This views synthetic data as a novel privacy enhancing technology and begins with the presumption that if the source data are ‘personal data’, then the output data (synthetic data) remain personal data, unless it can be demonstrated with a high degree of confidence that threats of re-identification are minimal and well safeguarded. 

In the health and genomic context this means that the use of personal data of patients to generate synthetic data should trigger an assessment by the data controller of whether the output synthetic data could (in combination with any other available sources) be used to identify a living individual. At this stage, it is not clear whether coincidental matching (i.e., where a living individual’s profile was not part of the training data but the generated synthetic profile accidentally matches them or a future existing real person) should also be considered. Such challenges are currently underexplored. 

As regular readers of these briefs will be aware, determining whether data are identifying or anonymous requires a multifaceted contextual risk assessment, placing risk on a spectrum of identifiability that asks data controllers to assess ‘the means reasonably likely to be used to identify an individual.’ (Recital 26, GDPR) Further detail on such assessments has been provided in a previous brief.

In addition to understanding how the data were generated, the privacy preservation techniques applied, the current and (possible) future processing environment, and the purposes for processing; data controllers may also need to consider the technical and organisational safeguards in place to determine if effective anonymisation has been reached (see table above). A reassessment will likely be triggered whenever any of these factors change throughout the dataset’s lifecycle. 

However, while this regulatory approach ensures that data protection and privacy are upheld as far as possible, it is important to note that it is not without cost or challenge: 

  • This may encourage a risk averse approach which could lead synthetic data developers to reduce utility of the data in a trade-off with privacy.
  • It is resource intensive given the time and expertise required to appropriately audit identification risks and to make subsequent adjustments to the data environment (although there are also considerable resource and financial benefits in terms of streamlining subsequent access that could outweigh the initial development costs).
  • This could result in stricter access controls and higher fees which could impede data accessibility
  • Ultimately, if synthetic data are considered to be personal data it may still be very difficult to determine how data rights and obligations apply where it is not clear that a relevant individual is the focus of the data.

Remaining Uncertainty

Unfortunately, the potential for confusion about the status of synthetic data in data protection law has been exacerbated by the drafting of the EU Artificial Intelligence Act (AIA), which seemingly groups synthetic data with anonymous or ‘other non-personal data’ (Article 59(1)(b)), despite data protection authorities around Europe adopting a more nuanced, case-by-case approach. However, the EU AI Act applies without prejudice to existing EU law, including the GDPR (recital 9), which means that the regulatory status and definition of synthetic data within data protection law remains unresolved. Moreover, questions have been raised on how well innovations, such as synthetic data and its generation methods and outputs, fit the existing data protection framework. For example, the GDPR and data protection guidance do not currently address the issue of coincidental matching with existing or future real persons (which could arise with synthetic data). Determining whether this risk is relevant to data protection will involve consideration of the appropriate role of data protection law (as opposed to other parts of the legal framework), and what forms of ‘harm’ it seeks to guard against. To apply data protection law to all forms of synthetic data without such reflection threatens to overstretch the concept of ‘personal data’ and warp the function of data protection law. 

However, until regulators, courts and policymakers address these issues head on, synthetic data developers and users should continue to follow data protection impact assessment and anonymisation best practices when assessing its identifiability and other data protection risks.

Further Reading

Relevant GDPR Provisions

Elizabeth Redrup Hill is a Senior Policy Analyst (Law and Regulation) in the Humanities Team at the PHG Foundation, a think tank with a special focus on genomics and personalised medicine that is a part of the University of Cambridge.

Colin Mitchell is Head of Humanities at the PHG Foundation.

See all previous briefs.

Please note that GDPR Briefs neither constitute nor should be relied upon as legal advice. Briefs represent a consensus position among Forum Members regarding the current understanding of the GDPR and its implications for genomic and health-related research. As such, they are no substitute for legal advice from a licensed practitioner in your jurisdiction.

Latest News

HDR UK and GA4GH strategic partnership
16 Apr 2024
HDR UK and GA4GH form a strategic partnership to unite genomic and health data
See more
Neon DNA strands intertwining with digital code symbolising the fusion of biology and technology
11 Apr 2024
GDPR Brief: when are synthetic health data personal data?
See more
6 Mar 2024
Putting GA4GH standards into practice: Mallory Freeberg and Alastair Thomson to lead GA4GH Implementation Forum
See more