METHOD AND SYSTEM FOR CLEANSING AND DE-DUPLICATING DATA
First Claim
1. A computer-implemented method for cleansing and de-duplicating data in a database, the computer-implemented method comprising:
- filtering garbage records from a plurality of records in the database based on data fields;
creating a cleansed database by applying cleansing rules to remove the garbage records;
generating similarity vectors, wherein each similarity vector corresponds to a pairwise comparison of distinct data entries in the cleansed database;
labeling each vector of the similarity vectors as one of matched vector, unmatched vector and unclassified vector by applying matching rules to find matched data and unmatched data in the cleansed database;
training a machine learning model to identify duplicates in the cleansed database based on analyzing vectors labeled as matched vector and unmatched vectors;
labeling unclassified vector in the cleansed database as matched vector or unmatched vector by applying the machine learning model on the unclassified vectors;
processing all of the vectors labeled as matched vectors to create clusters of records that are duplicates of each other in the cleansed database; and
merging records in each cluster to obtain a de-duplicated cleansed database using one or more predefined consolidated rules.
3 Assignments
0 Petitions
Accused Products
Abstract
Method and system for cleansing and de-duplicating data in database are provided. The method includes filtering garbage records from a plurality of records based on data fields, and applying cleansing rules to create a cleansed database. A similarity vector is generated, where each vector corresponds to pairwise comparison of distinct data entries in cleansed database. Matching rules are applied to label each vector as one of matched, unmatched and unclassified. The method analyzes the vectors labeled as matched and unmatched to train a machine learning model to identify duplicates in the cleansed database. Unclassified vectors in the cleansed database are labeled as matched or unmatched by applying machine learning model on unclassified vectors. Thereafter, the method processes all the vectors labeled as matched to create clusters of records that are duplicates of each other. Further, records in each cluster are merged to obtain de-duplicated cleansed database using predefined consolidated rules.
-
Citations
20 Claims
-
1. A computer-implemented method for cleansing and de-duplicating data in a database, the computer-implemented method comprising:
-
filtering garbage records from a plurality of records in the database based on data fields; creating a cleansed database by applying cleansing rules to remove the garbage records; generating similarity vectors, wherein each similarity vector corresponds to a pairwise comparison of distinct data entries in the cleansed database; labeling each vector of the similarity vectors as one of matched vector, unmatched vector and unclassified vector by applying matching rules to find matched data and unmatched data in the cleansed database; training a machine learning model to identify duplicates in the cleansed database based on analyzing vectors labeled as matched vector and unmatched vectors; labeling unclassified vector in the cleansed database as matched vector or unmatched vector by applying the machine learning model on the unclassified vectors; processing all of the vectors labeled as matched vectors to create clusters of records that are duplicates of each other in the cleansed database; and merging records in each cluster to obtain a de-duplicated cleansed database using one or more predefined consolidated rules. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A system for cleansing and de-duplicating data in a database, the system comprising:
-
a memory to store instructions for cleansing and de-duplicating data; and a processor configured to execute instructions stored in the memory, to cause the system to; filter garbage records from a plurality of records in the database based on data fields; create a cleansed database by applying cleansing rules to remove the garbage records; generate similarity vectors, wherein each similarity vector corresponds to a pairwise comparison of distinct data entries in the cleansed database; label each vector of the similarity vectors as one of matched vector, unmatched vector and unclassified vector by applying matching rules to find matched data and unmatched data in the cleansed database; train a machine learning model to identify duplicates in the cleansed database based on analyzing vectors labeled as matched vector and unmatched vectors; label unclassified vector in the cleansed database as matched vector or unmatched vector by applying the machine learning model on the unclassified vectors; process all of the vectors labeled as matched vectors to create clusters of records that are duplicates of each other in the cleansed database; and merge records in each cluster to obtain a de-duplicated cleansed database using one or more predefined consolidated rules. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18, 19)
-
-
20. A computer program product comprising at least one computer-readable storage medium, the computer-readable storage medium comprising a set of instructions which, when executed by one or more processors of a system, cause the system to:
-
filter garbage records from a plurality of records in the database based on data fields; create a cleansed database by applying cleansing rules to remove the garbage records; generate similarity vectors, wherein each similarity vector corresponds to a pairwise comparison of distinct data entries in the cleansed database; label each vector of the similarity vectors as one of matched vector, unmatched vector and unclassified vector by applying matching rules to find matched data and unmatched data in the cleansed database; train a machine learning model to identify duplicates in the cleansed database based on analyzing vectors labeled as matched vector and unmatched vectors; label unclassified vector in the cleansed database as matched vector or unmatched vector by applying the machine learning model on the unclassified vectors; process all of the vectors labeled as matched vectors to create clusters of records that are duplicates of each other in the cleansed database; and merge records in each cluster to obtain a de-duplicated cleansed database using one or more predefined consolidated rules.
-
Specification