×

METHOD AND SYSTEM FOR CLEANSING AND DE-DUPLICATING DATA

  • US 20170308557A1
  • Filed: 04/14/2017
  • Published: 10/26/2017
  • Est. Priority Date: 04/21/2016
  • Status: Active Grant
First Claim
Patent Images

1. A computer-implemented method for cleansing and de-duplicating data in a database, the computer-implemented method comprising:

  • filtering garbage records from a plurality of records in the database based on data fields;

    creating a cleansed database by applying cleansing rules to remove the garbage records;

    generating similarity vectors, wherein each similarity vector corresponds to a pairwise comparison of distinct data entries in the cleansed database;

    labeling each vector of the similarity vectors as one of matched vector, unmatched vector and unclassified vector by applying matching rules to find matched data and unmatched data in the cleansed database;

    training a machine learning model to identify duplicates in the cleansed database based on analyzing vectors labeled as matched vector and unmatched vectors;

    labeling unclassified vector in the cleansed database as matched vector or unmatched vector by applying the machine learning model on the unclassified vectors;

    processing all of the vectors labeled as matched vectors to create clusters of records that are duplicates of each other in the cleansed database; and

    merging records in each cluster to obtain a de-duplicated cleansed database using one or more predefined consolidated rules.

View all claims
  • 3 Assignments
Timeline View
Assignment View
    ×
    ×