METHOD AND SYSTEM FOR CLEANSING AND DE-DUPLICATING DATA

US 20170308557A1
Filed: 04/14/2017
Published: 10/26/2017
Est. Priority Date: 04/21/2016
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method for cleansing and de-duplicating data in a database, the computer-implemented method comprising:

filtering garbage records from a plurality of records in the database based on data fields;

creating a cleansed database by applying cleansing rules to remove the garbage records;

generating similarity vectors, wherein each similarity vector corresponds to a pairwise comparison of distinct data entries in the cleansed database;

labeling each vector of the similarity vectors as one of matched vector, unmatched vector and unclassified vector by applying matching rules to find matched data and unmatched data in the cleansed database;

training a machine learning model to identify duplicates in the cleansed database based on analyzing vectors labeled as matched vector and unmatched vectors;

labeling unclassified vector in the cleansed database as matched vector or unmatched vector by applying the machine learning model on the unclassified vectors;

processing all of the vectors labeled as matched vectors to create clusters of records that are duplicates of each other in the cleansed database; and

merging records in each cluster to obtain a de-duplicated cleansed database using one or more predefined consolidated rules.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Method and system for cleansing and de-duplicating data in database are provided. The method includes filtering garbage records from a plurality of records based on data fields, and applying cleansing rules to create a cleansed database. A similarity vector is generated, where each vector corresponds to pairwise comparison of distinct data entries in cleansed database. Matching rules are applied to label each vector as one of matched, unmatched and unclassified. The method analyzes the vectors labeled as matched and unmatched to train a machine learning model to identify duplicates in the cleansed database. Unclassified vectors in the cleansed database are labeled as matched or unmatched by applying machine learning model on unclassified vectors. Thereafter, the method processes all the vectors labeled as matched to create clusters of records that are duplicates of each other. Further, records in each cluster are merged to obtain de-duplicated cleansed database using predefined consolidated rules.

Citations

20 Claims

1. A computer-implemented method for cleansing and de-duplicating data in a database, the computer-implemented method comprising:
- filtering garbage records from a plurality of records in the database based on data fields;
  
  creating a cleansed database by applying cleansing rules to remove the garbage records;
  
  generating similarity vectors, wherein each similarity vector corresponds to a pairwise comparison of distinct data entries in the cleansed database;
  
  labeling each vector of the similarity vectors as one of matched vector, unmatched vector and unclassified vector by applying matching rules to find matched data and unmatched data in the cleansed database;
  
  training a machine learning model to identify duplicates in the cleansed database based on analyzing vectors labeled as matched vector and unmatched vectors;
  
  labeling unclassified vector in the cleansed database as matched vector or unmatched vector by applying the machine learning model on the unclassified vectors;
  
  processing all of the vectors labeled as matched vectors to create clusters of records that are duplicates of each other in the cleansed database; and
  
  merging records in each cluster to obtain a de-duplicated cleansed database using one or more predefined consolidated rules.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The computer implemented method of claim 1 further comprising extracting the data fields, prior to the step of filtering the garbage records, based on profiling of data sets in the database, wherein the profiling of data sets includes analyzing at least one of statistical properties of data, format of data, quality of data, quantity of data, and data pattern.
  - 3. The computer implemented method of claim 1, wherein creating the cleansed database comprises standardizing data in the data fields in a predefined format.
  - 4. The computer implemented method of claim 3, wherein standardizing the data comprises performing at least one of converting the data to upper case, converting the data to lower case, removing special characters from the data, arranging the data in a predefined pattern, replacing abbreviations with expanded data, adding of data in the data fields, and removing the data from the data fields.
  - 5. The computer implemented method of claim 1, wherein the garbage records are at least one of duplicate records and records that are non-useful for de-duplication.
  - 6. The computer implemented method of claim 1, wherein the generating the similarity vectors comprises identifying a score for each component of the similarity vector based on a predefined scoring algorithm.
  - 7. The computer implemented method of claim 1, wherein generating the similarity vectors comprises using a Jaro-Winkler string matching algorithm to generate components of the similarity vector.
  - 8. The computer implemented method of claim 1, wherein merging the records comprising identifying a master record in each cluster of records.
  - 9. The computer implemented method of claim 1 wherein labeling each vector of the similarity vector comprises performing a human assisted analysis of the similarity vector.

10. A system for cleansing and de-duplicating data in a database, the system comprising:
- a memory to store instructions for cleansing and de-duplicating data; and
  
  a processor configured to execute instructions stored in the memory, to cause the system to;
  
  filter garbage records from a plurality of records in the database based on data fields;
  
  create a cleansed database by applying cleansing rules to remove the garbage records;
  
  generate similarity vectors, wherein each similarity vector corresponds to a pairwise comparison of distinct data entries in the cleansed database;
  
  label each vector of the similarity vectors as one of matched vector, unmatched vector and unclassified vector by applying matching rules to find matched data and unmatched data in the cleansed database;
  
  train a machine learning model to identify duplicates in the cleansed database based on analyzing vectors labeled as matched vector and unmatched vectors;
  
  label unclassified vector in the cleansed database as matched vector or unmatched vector by applying the machine learning model on the unclassified vectors;
  
  process all of the vectors labeled as matched vectors to create clusters of records that are duplicates of each other in the cleansed database; and
  
  merge records in each cluster to obtain a de-duplicated cleansed database using one or more predefined consolidated rules.
- View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18, 19)
- - 11. The system of claim 10, wherein the processor is further configured to cause the system to extract the data fields based on profile of data sets in the database, wherein the profile of the data sets includes at least one of statistical properties of data, format of data, quality of data, quantity of data, and data pattern.
  - 12. The system of claim 10, wherein the system is caused to apply the cleansing rules by standardizing data in the data fields in a predefined format.
  - 13. The system of claim 12, wherein the system is caused to standardize the data by performing at least one of converting the data to upper case, converting the data to lower case, removing special characters from the data, arranging the data in a predefined pattern, replacing abbreviations with expanded data, adding of data in the data fields, and removing of data from the data fields.
  - 14. The system of claim 10, wherein the garbage records are at least one of duplicate records and records that are non-useful for de-duplication.
  - 15. The system of claim 10, wherein the processor is further configured to identify a score for each component of the similarity vector based on a predefined scoring algorithm.
  - 16. The system of claim 10, wherein the processor is further configured to use a Jaro-Winkler string matching algorithm to generate components of the similarity vector.
  - 17. The system of claim 10, wherein the processor is further configured to identify a master record in each cluster of records.
  - 18. The system of claim 10, wherein the processor is further configured to perform a human assisted analysis of similarity vector to label a vector as matched or unmatched based on a predefined condition.
  - 19. The system of claim 10, wherein the merged records are stored in an external database.

20. A computer program product comprising at least one computer-readable storage medium, the computer-readable storage medium comprising a set of instructions which, when executed by one or more processors of a system, cause the system to:
- filter garbage records from a plurality of records in the database based on data fields;
  
  create a cleansed database by applying cleansing rules to remove the garbage records;
  
  generate similarity vectors, wherein each similarity vector corresponds to a pairwise comparison of distinct data entries in the cleansed database;
  
  label each vector of the similarity vectors as one of matched vector, unmatched vector and unclassified vector by applying matching rules to find matched data and unmatched data in the cleansed database;
  
  train a machine learning model to identify duplicates in the cleansed database based on analyzing vectors labeled as matched vector and unmatched vectors;
  
  label unclassified vector in the cleansed database as matched vector or unmatched vector by applying the machine learning model on the unclassified vectors;
  
  process all of the vectors labeled as matched vectors to create clusters of records that are duplicates of each other in the cleansed database; and
  
  merge records in each cluster to obtain a de-duplicated cleansed database using one or more predefined consolidated rules.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
LeanTaaS, Inc.
Original Assignee
Leantaas
Inventors
CASSIDY, Hugh, DeMARCO, Sofia, LAKSHMIKANTHAN, Jayant

Granted Patent

US 10,558,627 B2
Time in Patent Office

Days
Field of Search
US Class Current
CPC Class Codes

G06F 12/0253   Garbage collection, i.e. re...

G06F 16/215   Improving data quality; Dat...

G06F 16/24556   Aggregation; Duplicate elim...

G06N 20/00   Machine learning

METHOD AND SYSTEM FOR CLEANSING AND DE-DUPLICATING DATA

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

METHOD AND SYSTEM FOR CLEANSING AND DE-DUPLICATING DATA

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links