System, method, and computer-accessible medium for evaluating multi-dimensional synthetic data using integrated variants analysis
First Claim
1. A non-transitory computer-accessible medium having stored thereon computer-executable instructions for evaluating at least one synthetic dataset, wherein, when a computer arrangement executes the instructions, the computer arrangement is configured to perform procedures comprising:
- receiving at least one original dataset;
receiving the at least one synthetic dataset;
training at least one model using the at least one original dataset and the at least one synthetic dataset;
generating a statistical correlation score based on the at least one synthetic dataset and the at least one original dataset;
generating an evaluation score by evaluating the at least one synthetic dataset based on the training of the least one model, wherein the evaluation score includes (i) a statistical correlation score, (ii) a data similarity score, and (iii) a data quality score;
determining a region for the at least one synthetic dataset based on the evaluation score, wherein the region includes one of(i) a normal region where the at least one synthetic dataset is unlikely to contain synthetic data that is similar to original data within the at least one original dataset,(ii) a warning region where the at least one synthetic dataset potentially contains the synthetic data that is similar to the original data, or(iii) a red flag region where the at least one synthetic dataset is likely to contain the synthetic data that is similar to the original data; and
generating a suggestion based on the evaluation score and the determined region, wherein the suggestion includes one of (i) indicating that the at least one synthetic dataset is adequate or (ii) warning that the at least one synthetic dataset potentially contains information similar to the at least one original dataset.
1 Assignment
0 Petitions
Accused Products
Abstract
An exemplary system, method, and computer-accessible medium can include, for example, receiving an original dataset(s), receiving a synthetic dataset(s), training a model(s) using the original dataset(s) and the synthetic dataset(s), and evaluating the synthetic dataset(s) based on the training of the model(s). The model(s) can include a first model and a second model, and the first model can be trained using the original dataset(s) and the second model can be trained using the synthetic dataset(s). The synthetic dataset(s) can be evaluated by comparing first results from the training of the first model to second results from the training of the second model.
-
Citations
20 Claims
-
1. A non-transitory computer-accessible medium having stored thereon computer-executable instructions for evaluating at least one synthetic dataset, wherein, when a computer arrangement executes the instructions, the computer arrangement is configured to perform procedures comprising:
-
receiving at least one original dataset; receiving the at least one synthetic dataset; training at least one model using the at least one original dataset and the at least one synthetic dataset; generating a statistical correlation score based on the at least one synthetic dataset and the at least one original dataset; generating an evaluation score by evaluating the at least one synthetic dataset based on the training of the least one model, wherein the evaluation score includes (i) a statistical correlation score, (ii) a data similarity score, and (iii) a data quality score; determining a region for the at least one synthetic dataset based on the evaluation score, wherein the region includes one of (i) a normal region where the at least one synthetic dataset is unlikely to contain synthetic data that is similar to original data within the at least one original dataset, (ii) a warning region where the at least one synthetic dataset potentially contains the synthetic data that is similar to the original data, or (iii) a red flag region where the at least one synthetic dataset is likely to contain the synthetic data that is similar to the original data; and generating a suggestion based on the evaluation score and the determined region, wherein the suggestion includes one of (i) indicating that the at least one synthetic dataset is adequate or (ii) warning that the at least one synthetic dataset potentially contains information similar to the at least one original dataset. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
-
-
15. A method for evaluating at least one synthetic dataset, comprising:
-
(a) receiving at least one original dataset; (b) generating the at least one synthetic dataset based on the at least one original dataset; (c) training at least one first model using the at least one original dataset; (d) training at least one second model using the at least one synthetic dataset; (e) generating a statistical correlation score based on the at least one synthetic dataset and the at least one original dataset using a computer arrangement, generating an evaluation score by evaluating the at least one synthetic dataset based on the training of the least one first model and the training of the at least one second model, wherein the evaluation score includes (i) a statistical correlation score, (ii) a data similarity score, and (iii) a data quality score; determining a region for the at least one synthetic dataset based on the evaluation score, wherein the region includes one of (i) a normal region where the at least one synthetic dataset is unlikely to contain synthetic data that is similar to original data within the at least one original dataset, (ii) a warning region where the at least one synthetic dataset potentially contains the synthetic data that is similar to the original data, or (iii) a red flag region where the at least one synthetic dataset is likely to contain the synthetic data that is similar to the original data; and generating a suggestion based on the evaluation score and the determined region, wherein the suggestion includes one of (i) indicating that the at least one synthetic dataset is adequate or (ii) warning that the at least one synthetic dataset potentially contains information similar to the at least one original dataset. - View Dependent Claims (16, 17, 18)
-
-
19. A system, comprising:
a computer hardware arrangement configured to; (a) receive at least one original dataset; (b) receive at least one synthetic dataset related to the at least one original dataset; (c) train at least one first model using the at least one original dataset; (d) train at least one second model using the at least one synthetic dataset; (e) generate a statistical correlation score based on the at least one synthetic dataset and the at least one original dataset; (f) generate an evaluation score by comparing first results from the training of the first model to second results from the training of the second model, wherein the evaluation score includes (i) a statistical correlation score, (ii) a data similarity score, and (iii) a data quality score; (g) determine a region for the at least one synthetic dataset based on the evaluation score, wherein the region includes one of (i) a normal region where the at least one synthetic dataset is unlikely to contain synthetic data that is similar to original data within the at least one original dataset, (ii) a warning region where the at least one synthetic dataset potentially contains the synthetic data that is similar to the original data, or (iii) a red flag region where the at least one synthetic dataset is likely to contain the synthetic data that is similar to the original data; and (h) generate a suggestion based on the evaluation score and the determined region, wherein the suggestion includes one of (i) indicating that the at least one synthetic dataset is adequate or (ii) warning that the at least one synthetic dataset potentially contains information similar to the at least one original dataset; (i) modify the at least one synthetic dataset based on the comparison; and (j) repeat procedures (d)-(f) until the comparison of the first results to the second results is less than a particular threshold. - View Dependent Claims (20)
Specification