Robust system for interactively learning a record similarity measurement
First Claim
1. A system for learning a record similarity measurement, said system comprising:
- a set of record clusters, each record in each cluster having a list of fields and data contained in each said field;
a predetermined threshold score for two of said records in one of said clusters to be considered similar;
at least one decision tree constructed from a predetermined portion of said set of clusters, said decision tree encoding rules for determining a field similarity score of a related set of said fields; and
a set of record pairs that may be determined to be duplicate records, said set of record pairs each having a record similarity score determined by said field similarity scores, said record pairs having a record similarity score greater than or equal to said predetermined threshold score being determined to be duplicate records.
1 Assignment
0 Petitions
Accused Products
Abstract
A system learns a record similarity measurement. The system includes a set of record clusters. Each record in each cluster may have a list of fields and data contained in each field. The system may further include a predetermined threshold score for two of the records in one of the clusters to be considered similar and at least one decision tree constructed from a portion of the set of clusters. The decision tree encodes rules for determining a field similarity score of a related set of fields. The system may further include an output set of record pairs that are determined to be duplicate records. The output set of record pairs may have a record similarity score greater than or equal to the predetermined threshold score.
112 Citations
19 Claims
-
1. A system for learning a record similarity measurement, said system comprising:
-
a set of record clusters, each record in each cluster having a list of fields and data contained in each said field;
a predetermined threshold score for two of said records in one of said clusters to be considered similar;
at least one decision tree constructed from a predetermined portion of said set of clusters, said decision tree encoding rules for determining a field similarity score of a related set of said fields; and
a set of record pairs that may be determined to be duplicate records, said set of record pairs each having a record similarity score determined by said field similarity scores, said record pairs having a record similarity score greater than or equal to said predetermined threshold score being determined to be duplicate records. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A method for learning a record similarity measurement, said method comprising the steps of:
-
providing a set of record clusters, each record in each cluster having a list of fields and data contained in each field;
providing a predetermined threshold score for two of the records in one of the clusters to be considered similar;
providing at least one decision tree constructed from a portion of the set of clusters, the decision tree encoding rules for determining a field similarity score of a related set of fields;
determining a record similarity score from the field similarity scores; and
outputting a set of record pairs that are determined to be duplicate records, the output set of record pairs having a record similarity score greater than or equal to the predetermined threshold score. - View Dependent Claims (9, 10, 11, 12, 13)
-
-
14. A computer program product for interactively learning a record similarity measurement, said product comprising:
-
an input set of record clusters, each record in each cluster having a list of fields and data contained in each field;
an predetermined input threshold score for two of the records in one of the clusters to be considered similar;
an input decision tree constructed from a portion of the set of clusters, the decision tree encoding rules for determining a field similarity score of a related set of fields;
an output set of record pairs that are determined to be duplicate records, the output set of record pairs having a record similarity score greater than or equal to the predetermined threshold score; and
a set of record pairs determined to be non-duplicate records. - View Dependent Claims (15, 16, 17, 18, 19)
-
Specification