Method for generating synthetic data sets at scale with non-redundant partitioning
First Claim
1. A method comprising:
- receiving, by a clustering module, a plurality of data sets, wherein each data set of the plurality of data sets includes a plurality of attributes;
partitioning, by the clustering module, the plurality of data sets into a plurality of clustered data sets including at least a first clustered data set and a second clustered data set, wherein each data set of the plurality of data sets is partitioned into one of the plurality of clustered data sets;
assigning, by a training module, a respective stochastic model to each respective clustered data set of the plurality of clustered data sets including;
assigning a first stochastic model to the first clustered data set, andassigning a second stochastic model to the second clustered data set;
selecting, by a first machine including a first memory and one or more processors in communication with the first memory, the first clustered data set and the first stochastic model;
selecting, by a second machine that is different from the first machine, the second machine including a second memory and one or more processors in communication with the second memory, the second clustered data set and the second stochastic model;
generating, by the first machine with the first stochastic model, a first synthetic data set, wherein the first synthetic data set has generated data for each one of the plurality of attributes;
generating, by the second machine with the second stochastic model, a second synthetic data set, wherein the second synthetic data set has generated data for each one of the plurality of attributes; and
testing at least one of an application and a database using each of the first synthetic data set and the second synthetic data set.
1 Assignment
0 Petitions
Accused Products
Abstract
An example system includes a first machine and a second machine, a clustering module, and a training module. The clustering module receives a plurality of data sets, each including attributes. The clustering module partitions the plurality of data sets into a first clustered data set and a second clustered data set. Each data set of the plurality of data sets is partitioned. The training module assigns a first stochastic model to the first clustered data set and a second stochastic model to the second clustered data set. The first machine selects the first clustered data set and the first stochastic model and generates a first synthetic data set having generated data for each one of the attributes. The second machine selects the second clustered data set and the second stochastic model and generates a second synthetic data set having generated data for each one of the attributes.
25 Citations
19 Claims
-
1. A method comprising:
-
receiving, by a clustering module, a plurality of data sets, wherein each data set of the plurality of data sets includes a plurality of attributes; partitioning, by the clustering module, the plurality of data sets into a plurality of clustered data sets including at least a first clustered data set and a second clustered data set, wherein each data set of the plurality of data sets is partitioned into one of the plurality of clustered data sets; assigning, by a training module, a respective stochastic model to each respective clustered data set of the plurality of clustered data sets including; assigning a first stochastic model to the first clustered data set, and assigning a second stochastic model to the second clustered data set; selecting, by a first machine including a first memory and one or more processors in communication with the first memory, the first clustered data set and the first stochastic model; selecting, by a second machine that is different from the first machine, the second machine including a second memory and one or more processors in communication with the second memory, the second clustered data set and the second stochastic model; generating, by the first machine with the first stochastic model, a first synthetic data set, wherein the first synthetic data set has generated data for each one of the plurality of attributes; generating, by the second machine with the second stochastic model, a second synthetic data set, wherein the second synthetic data set has generated data for each one of the plurality of attributes; and testing at least one of an application and a database using each of the first synthetic data set and the second synthetic data set. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
-
-
14. A system comprising:
-
a first machine including a first memory and one or more processors in communication with the first memory; a second machine that is different from the first machine, the second machine including a second memory and one or more processors in communication with the second memory; a clustering module, wherein the clustering module; receives a plurality of data sets, wherein each data set of the plurality of data sets includes a plurality of attributes; and partitions the plurality of data sets into a plurality of clustered data sets including at least a first clustered data set and a second clustered data set, wherein each data set of the plurality of data sets is partitioned into one of the plurality of clustered data sets; and a training module, wherein the training module; assigns a respective stochastic model to each respective clustered data set of the plurality of clustered data sets including; assigning a first stochastic model to the first clustered data set, and assigning a second stochastic model to the second clustered data set; wherein the first machine; selects the first clustered data set and the first stochastic model; and generates, with the first stochastic model, a first synthetic data set, wherein the first synthetic data set has generated data for each one of the plurality of attributes; and wherein the second machine; selects the second clustered data set and the second stochastic model; and generates, with the second stochastic model, a second synthetic data set, wherein the second synthetic data set has generated data for each one of the plurality of attributes, such that at least one of an application and a database is tested using each of the first synthetic data set and the second synthetic data set. - View Dependent Claims (15, 16, 17, 18)
-
-
19. A computer readable non-transitory storage medium comprising executable instructions that, when executed, cause
a clustering module to: -
receive a plurality of data sets, wherein each data set of the plurality of data sets includes a plurality of attributes; and partition the plurality of data sets into a plurality of clustered data sets including at least a first clustered data set and a second clustered data set, wherein each data set of the plurality of data sets is partitioned into one of the plurality of clustered data sets; a training module to; assign a respective stochastic model to each respective clustered data set of the plurality of clustered data sets including; assigning a first stochastic model to the first clustered data set, and assigning a second stochastic model to the second clustered data set; a first machine including a first memory and one or more processors in communication with the first memory to; select the first clustered data set and the first stochastic model; and generate, with the first stochastic model, a first synthetic data set, wherein the first synthetic data set has generated data for each one of the plurality of attributes; and a second machine that is different from the first machine, the second machine including a first memory and one or more processors in communication with the first memory to; select the second clustered data set and the second stochastic model; and generate, with the second stochastic model, a second synthetic data set, wherein the second synthetic data set has generated data for each one of the plurality of attributes, such that at least one of an application and a database is tested using each of the first synthetic data set and the second synthetic data set.
-
Specification