SDV: An Open Source Library for Synthetic Data Generation

Authors: Montanez, Andrew

Abstract: In this thesis, I designed three open source Python libraries with the intention of creating a robust system that can accurately generate synthetic data. The goals of this thesis were to separate the different components in synthetic data generation into their own libraries. We identified these components as consisting of a way to transform the data, a way to model the data, and a way to recursively traverse the data set to model the relationships between the table as well as the data set itself. Once the libraries were implemented and functioning, we designed a program to run the synthetic data generation process in parallel on subsets of the original data. The goal of this program was to see if the overall modeling time could be reduced by modeling subsets in parallel and then averaging the parameters. In the end, we test how close these averaged parameters are to the original to see if this is a valid modeling technique.

Citation (Chicago Manual of Style 17th edition)

Montanez, Andrew. 2018. “SDV: An Open Source Library for Synthetic Data Generation.” Master's thesis, Cambridge, Massachusetts: Massachusetts Institute of Technology.


  author = {Montanez, Andrew},
  title = {SDV: An Open Source Library for Synthetic Data
  school = {Massachusetts Institute of Technology},
  year = {2018},
  address = {Cambridge, Massachusetts},
  month = sep,
  x-download = {}

© 2019. All rights reserved.

Powered by Hydejack Pro v8.5.2