Robust system for interactively learning a record similarity measurement

US 20040181526A1
Filed: 03/11/2003
Published: 09/16/2004
Est. Priority Date: 03/11/2003
Status: Abandoned Application

First Claim

Patent Images

1. A system for learning a record similarity measurement, said system comprising:

a set of record clusters, each record in each cluster having a list of fields and data contained in each said field;

a predetermined threshold score for two of said records in one of said clusters to be considered similar;

at least one decision tree constructed from a predetermined portion of said set of clusters, said decision tree encoding rules for determining a field similarity score of a related set of said fields; and

a set of record pairs that may be determined to be duplicate records, said set of record pairs each having a record similarity score determined by said field similarity scores, said record pairs having a record similarity score greater than or equal to said predetermined threshold score being determined to be duplicate records.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system learns a record similarity measurement. The system includes a set of record clusters. Each record in each cluster may have a list of fields and data contained in each field. The system may further include a predetermined threshold score for two of the records in one of the clusters to be considered similar and at least one decision tree constructed from a portion of the set of clusters. The decision tree encodes rules for determining a field similarity score of a related set of fields. The system may further include an output set of record pairs that are determined to be duplicate records. The output set of record pairs may have a record similarity score greater than or equal to the predetermined threshold score.

112 Citations

19 Claims

1. A system for learning a record similarity measurement, said system comprising:
- a set of record clusters, each record in each cluster having a list of fields and data contained in each said field;
  
  a predetermined threshold score for two of said records in one of said clusters to be considered similar;
  
  at least one decision tree constructed from a predetermined portion of said set of clusters, said decision tree encoding rules for determining a field similarity score of a related set of said fields; and
  
  a set of record pairs that may be determined to be duplicate records, said set of record pairs each having a record similarity score determined by said field similarity scores, said record pairs having a record similarity score greater than or equal to said predetermined threshold score being determined to be duplicate records.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The system as set forth in claim 1 further including a select group of record pairs that are used to interactively determine the accuracy of said at least one decision tree.
  - 3. The system as set forth in claim 2 wherein said select group of record pairs are outputted to a user for interactively determining the accuracy of said at least one decision tree.
  - 4. The system as set forth in claim 3 wherein said similarity scores are modified by the user subsequent to the user reviewing said select group of record pairs.
  - 5. The system as set forth in claim 4 wherein said system outputs a record similarity function improved by the input of the user.
  - 6. The system as set forth in claim 5 wherein said system comprises part of a matching step in a data cleansing application.
  - 7. The system as set forth in claim 1 wherein a record in at least one said record cluster has no record similarity score greater than or equal to said predetermined threshold score, said one record having data pertaining to an entity other than the other records in said record cluster.

8. A method for learning a record similarity measurement, said method comprising the steps of:
- providing a set of record clusters, each record in each cluster having a list of fields and data contained in each field;
  
  providing a predetermined threshold score for two of the records in one of the clusters to be considered similar;
  
  providing at least one decision tree constructed from a portion of the set of clusters, the decision tree encoding rules for determining a field similarity score of a related set of fields;
  
  determining a record similarity score from the field similarity scores; and
  
  outputting a set of record pairs that are determined to be duplicate records, the output set of record pairs having a record similarity score greater than or equal to the predetermined threshold score.
- View Dependent Claims (9, 10, 11, 12, 13)
- - 9. The method as set forth in claim 8 further including the step of selecting a group of record pairs that are used to interactively determine the accuracy of the at least one decision tree.
  - 10. The method as set forth in claim 8 further including the step of outputting the selected group of record pairs to a user for interactively determining the accuracy of the at least one decision tree.
  - 11. The method as set forth in claim 8 further including the step of modifying the field similarity scores by the user subsequent to the user reviewing the selected group of record pairs.
  - 12. The method as set forth in claim 8 further including the step of outputting a record similarity function improved by the input from the user.
  - 13. The method as set forth in claim 8 wherein said method is conducted as part of a matching step in a data cleansing application.

14. A computer program product for interactively learning a record similarity measurement, said product comprising:
- an input set of record clusters, each record in each cluster having a list of fields and data contained in each field;
  
  an predetermined input threshold score for two of the records in one of the clusters to be considered similar;
  
  an input decision tree constructed from a portion of the set of clusters, the decision tree encoding rules for determining a field similarity score of a related set of fields;
  
  an output set of record pairs that are determined to be duplicate records, the output set of record pairs having a record similarity score greater than or equal to the predetermined threshold score; and
  
  a set of record pairs determined to be non-duplicate records.
- View Dependent Claims (15, 16, 17, 18, 19)
- - 15. The computer program product as set forth in claim 14 further including a selected group of record pairs that are used to determine the accuracy of the decision tree.
  - 16. The computer program product as set forth in claim 15 wherein the selected group of record pairs are outputted to a user for determining the accuracy of the decision tree.
  - 17. The computer program product as set forth in claim 16 wherein the record similarity score is modified by the user subsequent to the user reviewing the selected group of record pairs.
  - 18. The computer program product as set forth in claim 17 wherein said computer program product outputs a record similarity function improved by the input from the user.
  - 19. The computer program product as set forth in claim 18 wherein said computer program product comprises part of a matching step in a data cleansing application.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Lockheed Martin Corporation (Martin Marietta Corporation)
Original Assignee
Lockheed Martin Corporation (Martin Marietta Corporation)
Inventors
Szczerba, Robert J., Burdick, Douglas R.

Application Number

US10/385,828
Publication Number

US 20040181526A1
Time in Patent Office

Days
Field of Search
US Class Current

707/6
CPC Class Codes

G06F 16/285 Clustering or classification

Robust system for interactively learning a record similarity measurement

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

112 Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

Robust system for interactively learning a record similarity measurement

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

112 Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links