# The Synthetic Data Vault

Authors: Patki, N. and Wedge, R. and Veeramachaneni, K.

Abstract: The goal of this paper is to build a system that automatically creates synthetic data to enable data science endeavors. To achieve this, we present the Synthetic Data Vault (SDV), a system that builds generative models of relational databases. We are able to sample from the model and create synthetic data, hence the name SDV. When implementing the SDV, we also developed an algorithm that computes statistics at the intersection of related database tables. We then used a state-of-the-art multivariate modeling approach to model this data. The SDV iterates through all possible relations, ultimately creating a model for the entire database. Once this model is computed, the same relational information allows the SDV to synthesize data by sampling from any part of the database. After building the SDV, we used it to generate synthetic data for five different publicly available datasets. We then published these datasets, and asked data scientists to develop predictive models for them as part of a crowdsourced experiment. By analyzing the outcomes, we show that synthetic data can successfully replace original data for data science. Our analysis indicates that there is no significant difference in the work produced by data scientists who used synthetic data as opposed to real data. We conclude that the SDV is a viable solution for synthetic data generation.

### Citation (Chicago Manual of Style 17th edition)

———. 2016. “The Synthetic Data Vault.” In 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), 399–410. https://doi.org/10.1109/DSAA.2016.49.

### BibTeX

@inproceedings{7796926,
author = {{Patki}, N. and {Wedge}, R. and {Veeramachaneni}, K.},
booktitle = {2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)},
title = {The Synthetic Data Vault},
year = {2016},
pages = {399-410},
keywords = {data analysis;relational databases;synthetic data vault;SDV;generative model;relational database;multivariate modelling;predictive model;data analysis;data science;Data models;Databases;Computational modeling;Predictive models;Hidden Markov models;Numerical models;Synthetic data generation;crowd sourcing;data science;predictive modeling},
doi = {10.1109/DSAA.2016.49},
issn = {null},
month = oct
}