Robust system for interactively learning a string similarity measurement

US 20040181527A1
Filed: 03/11/2003
Published: 09/16/2004
Est. Priority Date: 03/11/2003
Status: Abandoned Application

First Claim

Patent Images

1. A system for learning a string similarity measurement, said system comprising:

a set of record clusters, each record in each cluster having a list of fields and data contained in each said field;

a set of initial weights for determining edit-distance measurements;

an initial field similarity function for assigning a field similarity score to each pair of field values in each cluster;

said set of initial weights and said field similarity function being modified by user feedback to produce an optimal set of edit-distance weights and an optimal field similarity function.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system learns a string similarity measurement. The system includes a set of record clusters. Each record in each cluster has a list of fields and data contained in each field. The system further includes a set of initial weights for determining edit distance measurements and an initial field similarity function for assigning a field similarity score to each pair of field values in each cluster. The set of initial weights and the field similarity function are modified by user feedback to produce an optimal set of edit-distance weights and an optimal field similarity function.

Citations

18 Claims

1. A system for learning a string similarity measurement, said system comprising:
- a set of record clusters, each record in each cluster having a list of fields and data contained in each said field;
  
  a set of initial weights for determining edit-distance measurements;
  
  an initial field similarity function for assigning a field similarity score to each pair of field values in each cluster;
  
  said set of initial weights and said field similarity function being modified by user feedback to produce an optimal set of edit-distance weights and an optimal field similarity function.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The system as set forth in claim 1 further including a select group of record pairs that are used to interactively determine said optimal set of edit-distance weights.
  - 3. The system as set forth in claim 2 wherein said select group of record pairs are outputted to a user to for interactively determining said optimal set of edit-distance weights.
  - 4. The system as set forth in claim 3 wherein said initial field similarity function is modified by the user subsequent to the user reviewing said select group of record pairs.
  - 5. The system as set forth in claim 4 wherein said system outputs a record similarity function improved by the input of the user.
  - 6. The system as set forth in claim 5 wherein said system comprises part of a matching step in a data cleansing application.

7. A method for learning a string similarity measurement, said method comprising the steps of:
- providing a set of record clusters, each record in each cluster having a list of fields and data contained in each field;
  
  providing a set of initial weights for determining edit-distance measurements;
  
  providing an initial field similarity function for assigning a field similarity score to each pair of field values in each cluster;
  
  modifying the set of initial weights and the field similarity function by user feedback to produce an optimal set of edit-distance weights and an optimal field similarity function.
- View Dependent Claims (8, 9, 10, 11, 12)
- - 8. The method as set forth in claim 7 further including the step of selecting a group of record pairs that are used to interactively determine the optimal field similarity function.
  - 9. The method as set forth in claim 7 further including the step of outputting the selected group of record pairs to a user for interactively determining the optimal field similarity function.
  - 10. The method as set forth in claim 7 further including the step of modifying the initial field similarity function by the user subsequent to the user reviewing the selected group of record pairs.
  - 11. The method as set forth in claim 7 further including the step of outputting a record similarity function improved by the input from the user.
  - 12. The method as set forth in claim 7 wherein said method is conducted as part of a matching step in a data cleansing application.

13. A computer program product for interactively learning a string similarity measurement, said product comprising:
- an input set of record clusters, each record in each cluster having a list of fields and data contained in each field;
  
  a set of initial weights for determining edit-distance measurements;
  
  an initial field similarity function for assigning a field similarity score to each pair of field values in each cluster;
  
  said set of initial weights and said field similarity function being modified by user feedback to produce an optimal set of edit-distance weights and an optimal field similarity function.
- View Dependent Claims (14, 15, 16, 17, 18)
- - 14. The computer program product as set forth in claim 13 further including a selected group of record pairs that are used to determine said optimal set of edit-distance weights and said optimal field similarity function.
  - 15. The computer program product as set forth in claim 14 wherein the selected group of record pairs are outputted to a user for determining said optimal set of edit-distance weights and said optimal field similarity function.
  - 16. The computer program product as set forth in claim 15 wherein a record similarity score is modified by the user subsequent to the user reviewing the selected group of record pairs.
  - 17. The computer program product as set forth in claim 16 wherein said computer program product outputs a record similarity function improved by the input from the user.
  - 18. The computer program product as set forth in claim 17 wherein said computer program product comprises part of a matching step in a data cleansing application.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Lockheed Martin Corporation (Martin Marietta Corporation)
Original Assignee
Lockheed Martin Corporation (Martin Marietta Corporation)
Inventors
Szczerba, Robert J., Burdick, Douglas R.

Application Number

US10/385,897
Publication Number

US 20040181527A1
Time in Patent Office

Days
Field of Search
US Class Current

707/6
CPC Class Codes

G06F 16/285 Clustering or classification

Robust system for interactively learning a string similarity measurement

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Robust system for interactively learning a string similarity measurement

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links