Tech Champion: Robert Riemann
Synthetic data is artificial data that is generated from original data and a model that is trained to reproduce the characteristics and structure of the original data. This means that synthetic data and original data should deliver very similar results when undergoing the same statistical analysis. The degree to which synthetic data is an accurate proxy for the original data is a measure of the utility of the method and the model.
The generation process, also called synthesis, can be performed using different techniques, such as decision trees, or deep learning algorithms. Synthetic data can be classified with respect to the type of the original data: the first type employs real datasets, the second employs knowledge gathered by the analysts instead, and the third type is a combination of these two. Generative Adversarial Networks (GANs) were introduced recently and are commonly used in the field of image recognition. They are generally composed of two neural networks training each other iteratively. The generator network produces synthetic images that the discriminator network tries to identify as such in comparison to real images.
A privacy assurance assessment should be performed to ensure that the resulting synthetic data is not actual personal data. This privacy assurance evaluates the extent to which data subjects can be identified in the synthetic data and how much new data about those data subjects would be revealed upon successful identification.
Synthetic data is gaining traction within the machine learning domain. It helps training machine learning algorithms that need an immense amount of labeled training data, which can be costly or come with data usage restrictions. Moreover, manufacturers can use synthetic data for software testing and quality assurance. Synthetic data can help companies and researchers build data repositories needed to train and even pre-train machine learning models, a technique referred to as transfer learning.
Positive foreseen impacts on data protection:
- Enhancing privacy in technologies: from a data protection by design approach, this technology could provide, upon a privacy assurance assessment, an added value for the privacy of individuals, whose personal data does not have to be disclosed.
- Improved fairness: synthetic data might contribute to mitigate bias by using fair synthetic datasets to train artificial intelligence models. These datasets are manipulated to have a better representativeness of the world (to be less as it is, and more as society would like it to be). For instance, without gender-based or racial discrimination.
Negative foreseen impacts on data protection:
- Output control could be complex: especially in complex datasets, the best way to ensure the output is accurate and consistent is by comparing synthetic data with original data, or human-annotated data. However, for this comparison again access to the original data is required.
- Difficulty to map outliers: synthetic data can only mimic real-world data; it is not a replica. Therefore, synthetic data may not cover some outliers that original data has. However, outliers in the data can be more important than regular data points for some applications.
- Quality of the model depends on the data source: the quality of synthetic data is highly correlated with the quality of the original data and the data generation model. Synthetic data may reflect the biases in original data. Also, the manipulation of datasets to create fair synthetic datasets might result in inaccurate data.
Our three picks of suggested readings:
- T. E. Raghunathan, Synthetic data, Annual Review of Statistics and Its Application, 8, 129-140, 2021.
- K. Dankar, I. Mahmoud. Fake it till you make it: guidelines for effective synthetic data generation, Applied Sciences 11.5 (2021): 2158, 2021.
- J. Hradec, M. Craglia, M. Di Leo, S. De Nigris, N. Ostlaender, N. Nicholson, Multipurpose synthetic population for policy applications, EUR 31116 EN, Publications Office of the European Union, Luxembourg, ISBN 978-92-76-53478-5 (online), doi:10.2760/50072 (online), JRC128595, 2022.