Is the future of privacy synthetic?

Thomas Zerdick, Head of Technology and Privacy


IPEN workshops bring together privacy experts and engineers from public authorities, industry, academia and civil society to discuss relevant challenges and developments for the technological implementation of data protection and privacy in real life.

A key element for the identification of the material scope of EU data protection rules, such as the General Data Protection Regulation, is the definition of personal data: the principles of data protection apply to any information concerning an identified or identifiable natural person.

In contrast, the principles of data protection do not apply to anonymous information, namely information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the individual is not or no longer identifiable.

The availability of large amounts of personal data, the increasing level of interconnection among digital systems, the growing and cheaper computing power and the capabilities to draw inferences using Artificial Intelligence (AI) all challenge the very concept of anonymous data.

At the same time, there is one technology gaining momentum in the AI context that promises to provide both robust privacy protection and, simultaneously, the possibility to generate useful data when it is not available: data synthesis.

The Organisation for Economic Co-operation and Development (OECD) defines synthetic data as:

“An approach to confidentiality where instead of disseminating real data, synthetic data that have been generated from one or more population models are released.”

The concept of synthetic data generation is to take an original data source (dataset) and create new artificial data with similar statistical properties from it. Keeping the statistical properties means that anyone analysing the synthetic data, a data analyst for example, should be able to draw the same statistical conclusions from the analysis of a given dataset of synthetic data as he/she would if given the real (original) data.

A popular example can be seen on the website which generates synthetic photos of people.

So how useful could synthetic data be in the context of data protection?

To explore this question in greater detail, the EDPS IPEN webinar, held on 17 June 2021, titled “Synthetic data: what use cases as a privacy enhancing technology?” brought together approximately 170 expert practitioners and privacy professionals from both sides of the Atlantic, mostly from industry but also from academia.

The focus of the discussion was on the use of synthetic data instead of real data as an applied privacy measure in certain domains (mostly healthcare) and in use cases, such as AI and data science projects or software testing and simulations for technology assessments.

The challenge is to have data that are still useful for the set purposes, e.g. medical research, because they maintain the same statistical properties as the original data, but are no longer those originally collected from individuals.

Every speaker shared their view on how the use of data synthesis compares to the “classical” anonymisation methods of (original) data, and whether it provides a relevant benefit. Proposed methodologies exist to calculate how much the use of synthetic data could reduce the privacy risks and even commonly used acceptable privacy risk thresholds.

Independently of the discussion of whether synthetic data are personal data or not, it seems reasonable to consider that, from a data protection by design approach, this technology provides an added value for the privacy of individuals, when compared to the disclosure of the original data.

At the same time, the webinar clearly showed that the debate is still open on whether the privacy benefit(s) of using data synthesis is meaningful or not. The debate raises an important question: what kind of threat scenarios should be considered to prevent these from harming individuals and what further measures and conditions are necessary to avoid re-identification and other privacy risks stemming from the use of synthetic data?

More insights are needed. The EDPS and the IPEN project will therefore continue to follow the development of privacy engineering and the state of the art of synthetic data. 

The video recordings and speakers' presentations for each session are available on the IPEN event webpage.