×

Detecting duplicate records in database

  • US 6,961,721 B2
  • Filed: 06/28/2002
  • Issued: 11/01/2005
  • Est. Priority Date: 06/28/2002
  • Status: Active Grant
First Claim
Patent Images

1. In a database having records stored on a computer readable medium, a computer implemented method for identifying possible duplicate data records comprising:

  • a) providing multiple records in one or more tables that include multiple fields; and

    b) identifying two or more records within a table as possible duplicates by measuring a co-occurrence of data in two or more hierarchically related fields of the table;

    wherein measuring co-occurrence is performed by;

    a) identifying a candidate set of records having a first record field from records in the table or tables; and

    b) determining a commonality between records having first record fields based on tokens in the a set of one or more child record fields related to the first record fields to identify possible duplicate records from the candidate set of records, which predicts a record v1 is a possible duplicate of another record v2 if a containment metric of tokens from a field of v1 in the record v2 is greater than or equal to a threshold value;

    wherein a) a textual similarity between tokens in said first field compared in determining a token containment metric;

    b) a textual similarity between tokens in said child record fields is compared in determining a foreign key containment metric; and

    c) both the token containment metric and the foreign key containment metric are combined in identifying possible duplicate records in the candidate set of records; and

    wherein a choice is made between a test of the token containment metric and test of the foreign key containment metric based on an information content of the tokens used to determine said token containment metric and foreign key containment metric.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×