Abstract:
In this thesis, I designed three open source Python libraries with the intention of
creating a robust system that can accurately generate synthetic data. The goals of
this thesis were to separate the different components in synthetic data generation
into their own libraries. We identified these components as consisting of a way to
transform the data, a way to model the data, and a way to recursively traverse the
data set to model the relationships between the table as well as the data set itself.
Once the libraries were implemented and functioning, we designed a program to
run the synthetic data generation process in parallel on subsets of the original data.
The goal of this program was to see if the overall modeling time could be reduced
by modeling subsets in parallel and then averaging the parameters. In the end, we
test how close these averaged parameters are to the original to see if this is a valid
modeling technique.
Citation (Chicago Manual of Style 17th edition)
Montanez, Andrew. 2018. “SDV: An Open Source Library for Synthetic Data
Generation.” Master's thesis, Cambridge, Massachusetts: Massachusetts Institute of Technology.
BibTeX
@mastersthesis{mastersthesiu,
author = {Montanez, Andrew},
title = {SDV: An Open Source Library for Synthetic Data
Generation},
school = {Massachusetts Institute of Technology},
year = {2018},
address = {Cambridge, Massachusetts},
month = sep,
x-download = {https://dai.lids.mit.edu/wp-content/uploads/2018/12/Andrew_MEng.pdf}
}