System, method, and computer-accessible medium for evaluating multi-dimensional synthetic data using integrated variants analysis

US 10,635,939 B2
Filed: 10/04/2018
Issued: 04/28/2020
Est. Priority Date: 07/06/2018
Status: Active Grant

First Claim

Patent Images

1. A non-transitory computer-accessible medium having stored thereon computer-executable instructions for evaluating at least one synthetic dataset, wherein, when a computer arrangement executes the instructions, the computer arrangement is configured to perform procedures comprising:

receiving at least one original dataset;

receiving the at least one synthetic dataset;

training at least one model using the at least one original dataset and the at least one synthetic dataset;

generating a statistical correlation score based on the at least one synthetic dataset and the at least one original dataset;

generating an evaluation score by evaluating the at least one synthetic dataset based on the training of the least one model, wherein the evaluation score includes (i) a statistical correlation score, (ii) a data similarity score, and (iii) a data quality score;

determining a region for the at least one synthetic dataset based on the evaluation score, wherein the region includes one of(i) a normal region where the at least one synthetic dataset is unlikely to contain synthetic data that is similar to original data within the at least one original dataset,(ii) a warning region where the at least one synthetic dataset potentially contains the synthetic data that is similar to the original data, or(iii) a red flag region where the at least one synthetic dataset is likely to contain the synthetic data that is similar to the original data; and

generating a suggestion based on the evaluation score and the determined region, wherein the suggestion includes one of (i) indicating that the at least one synthetic dataset is adequate or (ii) warning that the at least one synthetic dataset potentially contains information similar to the at least one original dataset.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An exemplary system, method, and computer-accessible medium can include, for example, receiving an original dataset(s), receiving a synthetic dataset(s), training a model(s) using the original dataset(s) and the synthetic dataset(s), and evaluating the synthetic dataset(s) based on the training of the model(s). The model(s) can include a first model and a second model, and the first model can be trained using the original dataset(s) and the second model can be trained using the synthetic dataset(s). The synthetic dataset(s) can be evaluated by comparing first results from the training of the first model to second results from the training of the second model.

Citations

20 Claims

1. A non-transitory computer-accessible medium having stored thereon computer-executable instructions for evaluating at least one synthetic dataset, wherein, when a computer arrangement executes the instructions, the computer arrangement is configured to perform procedures comprising:
- receiving at least one original dataset;
  
  receiving the at least one synthetic dataset;
  
  training at least one model using the at least one original dataset and the at least one synthetic dataset;
  
  generating a statistical correlation score based on the at least one synthetic dataset and the at least one original dataset;
  
  generating an evaluation score by evaluating the at least one synthetic dataset based on the training of the least one model, wherein the evaluation score includes (i) a statistical correlation score, (ii) a data similarity score, and (iii) a data quality score;
  
  determining a region for the at least one synthetic dataset based on the evaluation score, wherein the region includes one of(i) a normal region where the at least one synthetic dataset is unlikely to contain synthetic data that is similar to original data within the at least one original dataset,(ii) a warning region where the at least one synthetic dataset potentially contains the synthetic data that is similar to the original data, or(iii) a red flag region where the at least one synthetic dataset is likely to contain the synthetic data that is similar to the original data; and
  
  generating a suggestion based on the evaluation score and the determined region, wherein the suggestion includes one of (i) indicating that the at least one synthetic dataset is adequate or (ii) warning that the at least one synthetic dataset potentially contains information similar to the at least one original dataset.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
- - 2. The computer-accessible medium of claim 1, wherein the at least one model includes a first model and a second model, and wherein the computer arrangement is further configured to:
    - train the first model using the at least one original dataset; and
      
      train the second model using the at least one synthetic dataset.
  - 3. The computer-accessible medium of claim 2, wherein the computer arrangement is configured to evaluate the at least one synthetic dataset by comparing first results from the training of the first model to second results from the training of the second model.
  - 4. The computer-accessible medium of claim 3, wherein the computer arrangement is configured to compare the first results to the second results using an analysis of variance procedure.
  - 5. The computer-accessible medium of claim 2, wherein the computer arrangement is configured to compare the first results to the second results using a threshold procedure.
  - 6. The computer-accessible medium of claim 5, wherein the threshold procedure includes:
    - summing first errors from the first results;
      
      summing second errors from the second results; and
      
      comparing the summed first errors to the summed second errors.
  - 7. The computer-accessible medium of claim 6, wherein the computer arrangement is configured to compare the summed first errors to the summed second errors using a threshold criterion.
  - 8. The computer-accessible medium of claim 5, wherein the threshold procedure includes determining a further statistical correlation based on a plurality of covariance matrices.
  - 9. The computer-accessible medium of claim 2, wherein the first model is equivalent to the second model.
  - 10. The computer-accessible medium of claim 1, wherein the at least one model is a classification model.
  - 11. The computer-accessible medium of claim 1, wherein the computer arrangement is further configured to generate the at least one synthetic dataset.
  - 12. The computer-accessible medium of claim 11, wherein the computer arrangement is configured to generate the at least one synthetic dataset based on the at least one original dataset.
  - 13. The computer-accessible medium of claim 1, wherein the computer arrangement is further configured to generate at least one further synthetic dataset based on (i) the at least one synthetic dataset and (ii) the evaluation of the at least one synthetic dataset.
  - 14. The computer-accessible medium of claim 1, wherein the at least one original dataset and the at least one synthetic dataset include at least one of (i) biographical information regarding a plurality of customers or (ii) financial information regarding the plurality of customers.

15. A method for evaluating at least one synthetic dataset, comprising:
- (a) receiving at least one original dataset;
  
  (b) generating the at least one synthetic dataset based on the at least one original dataset;
  
  (c) training at least one first model using the at least one original dataset;
  
  (d) training at least one second model using the at least one synthetic dataset;
  
  (e) generating a statistical correlation score based on the at least one synthetic dataset and the at least one original datasetusing a computer arrangement, generating an evaluation score by evaluating the at least one synthetic dataset based on the training of the least one first model and the training of the at least one second model, wherein the evaluation score includes (i) a statistical correlation score, (ii) a data similarity score, and (iii) a data quality score;
  
  determining a region for the at least one synthetic dataset based on the evaluation score, wherein the region includes one of(i) a normal region where the at least one synthetic dataset is unlikely to contain synthetic data that is similar to original data within the at least one original dataset,(ii) a warning region where the at least one synthetic dataset potentially contains the synthetic data that is similar to the original data, or(iii) a red flag region where the at least one synthetic dataset is likely to contain the synthetic data that is similar to the original data; and
  
  generating a suggestion based on the evaluation score and the determined region, wherein the suggestion includes one of (i) indicating that the at least one synthetic dataset is adequate or (ii) warning that the at least one synthetic dataset potentially contains information similar to the at least one original dataset.
- View Dependent Claims (16, 17, 18)
- - 16. The method of claim 15, further comprising generating at least one further synthetic dataset based on the evaluation score and the at least one synthetic dataset.
  - 17. The method of claim 16, further comprising training the at least one second model based on the at least one further synthetic dataset.
  - 18. The method of claim 17, further comprising evaluating the at least one further synthetic dataset based on the training of the at least one second model on the at least one further synthetic dataset.

19. A system, comprising:
- a computer hardware arrangement configured to;
  
  (a) receive at least one original dataset;
  
  (b) receive at least one synthetic dataset related to the at least one original dataset;
  
  (c) train at least one first model using the at least one original dataset;
  
  (d) train at least one second model using the at least one synthetic dataset;
  
  (e) generate a statistical correlation score based on the at least one synthetic dataset and the at least one original dataset;
  
  (f) generate an evaluation score by comparing first results from the training of the first model to second results from the training of the second model, wherein the evaluation score includes (i) a statistical correlation score, (ii) a data similarity score, and (iii) a data quality score;
  
  (g) determine a region for the at least one synthetic dataset based on the evaluation score, wherein the region includes one of(i) a normal region where the at least one synthetic dataset is unlikely to contain synthetic data that is similar to original data within the at least one original dataset,(ii) a warning region where the at least one synthetic dataset potentially contains the synthetic data that is similar to the original data, or(iii) a red flag region where the at least one synthetic dataset is likely to contain the synthetic data that is similar to the original data; and
  
  (h) generate a suggestion based on the evaluation score and the determined region, wherein the suggestion includes one of (i) indicating that the at least one synthetic dataset is adequate or (ii) warning that the at least one synthetic dataset potentially contains information similar to the at least one original dataset;
  
  (i) modify the at least one synthetic dataset based on the comparison; and
  
  (j) repeat procedures (d)-(f) until the comparison of the first results to the second results is less than a particular threshold.
- View Dependent Claims (20)
- - 20. The system of claim 19, wherein the computer arrangement is configured to compare the first results with the second results using an analysis of variance procedure.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Capital One Services LLC (Capital One Financial Corporation)
Original Assignee
Capital One Services LLC (Capital One Financial Corporation)
Inventors
Watson, Mark, Abad, Fardin Abdi Taghi, Truong, Anh, Taylor, Kenneth, Farivar, Reza, Goodsitt, Jeremy, Walters, Austin, Pham, Vincent
Primary Examiner(s)
Perveen, Rehana
Assistant Examiner(s)
Gan, Chuen-Meei

Application Number

US16/152,072
Publication Number

US 20200012891A1
Time in Patent Office

572 Days
Field of Search
US Class Current
CPC Class Codes

G06F 11/3608   using formal methods, e.g. ...

G06F 11/3628   of optimised code optimisat...

G06F 11/3636   by tracing the execution of...

G06F 11/3684   for test design, e.g. gener...

G06F 11/3688   for test execution, e.g. sc...

G06F 16/215   Improving data quality; Dat...

G06F 16/2237   Vectors, bitmaps or matrices

G06F 16/2264   Multidimensional index stru...

G06F 16/2423   Interactive query statement...

G06F 16/24568   Data stream processing; Con...

G06F 16/248   Presentation of query results

G06F 16/254   Extract, transform and load...

G06F 16/258   Data format conversion from...

G06F 16/283   Multi-dimensional databases...

G06F 16/285   Clustering or classification

G06F 16/288   Entity relationship models

G06F 16/335   Filtering based on addition...

G06F 16/35   Clustering; Classification

G06F 16/90332   Natural language query form...

G06F 16/90335   Query processing

G06F 16/9038 : Presentation of query results

G06F 16/906 : Clustering; Classification

G06F 16/93 : Document management systems

G06F 17/15 : Correlation function comput...

G06F 17/16 : Matrix or vector computatio...

G06F 17/18 : for evaluating statistical ...

G06F 18/2115 : by evaluating different sub...

G06F 18/213 : Feature extraction, e.g. by...

G06F 18/214 : Generating training pattern...

G06F 18/2148 : characterised by the proces...

G06F 18/217 : Validation; Performance eva...

G06F 18/2193 : based on specific statistic...

G06F 18/22 : Matching criteria, e.g. pro...

G06F 18/23 : Clustering techniques

G06F 18/24 : Classification techniques

G06F 18/2411 : based on the proximity to a...

G06F 18/2415 : based on parametric or prob...

G06F 18/285 : Selection of pattern recogn...

G06F 18/40 : Software arrangements speci...

G06F 21/552 : involving long-term monitor...

G06F 21/60 : Protecting data

G06F 21/6245 : Protecting personal data, e...

G06F 21/6254 : by anonymising data, e.g. d...

G06F 30/20 : Design optimisation, verifi...

G06F 40/117 : Tagging; Marking up details...

G06F 40/166 : Editing, e.g. inserting or ...

G06F 40/20 : Natural language analysis s...

G06F 8/71 : Version control security ar...

G06F 9/54 : Interprogram communication

G06F 9/541 : via adapters, e.g. between ...

G06F 9/547 : Remote procedure calls [RPC...

G06N 20/00 : Machine learning

G06N 20/10 : using kernel methods, e.g. ...

G06N 20/20 : Ensemble learning

G06N 3/04 : Architecture, e.g. intercon...

G06N 3/044 : Recurrent networks, e.g. Ho...

G06N 3/045 : Combinations of networks

G06N 3/047 : Probabilistic or stochastic...

G06N 3/06 : Physical realisation, i.e. ...

G06N 3/08 : Learning methods

G06N 3/084 : Backpropagation, e.g. using...

G06N 3/088 : Non-supervised learning, e....

G06N 3/094 : Adversarial learning

G06N 5/00 : Computing arrangements usin...

G06N 5/01 : Dynamic search techniques; ...

G06N 5/02 : Knowledge representation; S...

G06N 5/022 : Knowledge engineering; Know...

G06N 5/04 : Inference or reasoning models

G06N 7/00 : Computing arrangements base...

G06N 7/01 : Probabilistic graphical mod...

G06Q 10/04 : Forecasting or optimisation...

G06T 11/001 : Texturing; Colouring; Gener...

G06T 2207/10016 : Video; Image sequence

G06T 2207/10024 : Color image

G06T 2207/20081 : Training; Learning

G06T 2207/20084 : Artificial neural networks ...

G06T 7/194 : involving foreground-backgr...

G06T 7/246 : using feature-based methods...

G06T 7/248 : involving reference images ...

G06T 7/254 : involving subtraction of im...

G06V 10/768 : using context analysis, e.g...

G06V 10/993 : Evaluation of the quality o...

G06V 30/194 : References adjustable by an...

G06V 30/1985 : Syntactic analysis, e.g. us...

H04L 63/1416 : Event detection, e.g. attac...

H04L 63/1491 : using deception as counterm...

H04L 67/306 : User profiles

H04L 67/34 : involving the movement of s...

H04N 21/23412 : for generating or manipulat...

H04N 21/8153 : comprising still images, e....

View All

System, method, and computer-accessible medium for evaluating multi-dimensional synthetic data using integrated variants analysis

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

System, method, and computer-accessible medium for evaluating multi-dimensional synthetic data using integrated variants analysis

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links