Duplicate data elimination system
First Claim
1. A process for finding a similar data records from a set of data records comprising:
- providing a number of data records from which one or more canonical data records are identified;
determining a similarity score for data records based on the contents of the records;
grouping together data records whose similarity score with respect to each other is greater than a threshold to form one or more groups of data records that form nodes of a graph wherein edges between nodes represent a similarity score between records of a group; and
within each said group, identifying a canonical record based on the similarity of data records to each other within the group.
2 Assignments
0 Petitions
Accused Products
Abstract
A process for finding a similar data records from a set of data records. A database table or tables provide a number of data records from which one or more canonical data records are identified. Tokens are identified within the data records and classified according to attribute field. A similarity score is assigned to data records in relation to other data records based on a similarity between tokens of the data records. Data records whose similarity score with respect to each other is greater than a threshold form one or more groups of data records. The records or tuples form nodes of a graph wherein edges between nodes represent a similarity score between records of a group. Within each group a canonical record is identified based on the similarity of data records to each other within the group.
109 Citations
21 Claims
-
1. A process for finding a similar data records from a set of data records comprising:
-
providing a number of data records from which one or more canonical data records are identified;
determining a similarity score for data records based on the contents of the records;
grouping together data records whose similarity score with respect to each other is greater than a threshold to form one or more groups of data records that form nodes of a graph wherein edges between nodes represent a similarity score between records of a group; and
within each said group, identifying a canonical record based on the similarity of data records to each other within the group. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A process for finding similar data records from a set of data records comprising:
-
providing a number of data records from which one or more canonical data records are identified;
identifying tokens contained within the data records and classifying the tokens according to attribute field; and
preparing a graph by assigning a similarity score to data records in the reference table in relation to other records based on a similarity between tokens of said data records and assigning records to nodes of said graph and similarity scores to edges that interconnect the nodes;
partitioning the graph by grouping together data records whose similarity score with respect to each other is greater than a threshold to form one or more groups of data records; and
within a group, identifying a canonical record based on the similarity of data records to each other within the group. - View Dependent Claims (8, 9, 10)
-
-
11. A system for process for removing duplicate data records from a set of data records comprising:
-
a database management system containing a number of data records contained in one or more tables from which one or more data records are removed;
a processor for identifying tokens contained within the data records and classifying the tokens according to attribute field and wherein said processor assigns a similarity score to data records in the reference table in relation to other data records based on a similarity between tokens of said data records; and
wherein said processor groups together data records whose similarity score with respect to each other is greater than a threshold to form one or more groups of data records that form nodes of a graph wherein edges between nodes represent a similarity score between records of a group; and
then identifies a canonical record based on the similarity of data records to each other within the group. - View Dependent Claims (12, 13, 14)
-
-
15. Apparatus for finding a canonical data record for a set of two or more data records comprising:
-
means for providing a reference table having a number of reference records from which canonical data records are identified;
means for identifying reference table tokens contained within the reference records of the reference table and classifying the reference table tokens according to attribute field; and
means for assigning a similarity score to evaluation data records in the reference table in relation to other records based on a similarity between tokens of said evaluation data records;
means for grouping together evaluation records whose similarity score with respect to each other is greater than a threshold to form groups of records that form nodes of a graph wherein edges between nodes represent a similarity between records of a group wherein each said group identifying a canonical record based on the similarity of data records to each other within the group.
-
-
16. A machine readable medium including instructions for executing on a computer for finding a similar data records from a set of data records comprising instructions for:
-
obtaining a number of data records from which one or more canonical data records are identified;
determining a similarity score for data records based on the contents of the records;
grouping together data records whose similarity score with respect to each other is greater than a threshold to form one or more groups of data records that form nodes of a graph wherein edges between nodes represent a similarity score between records of a group; and
identifying a canonical record within each group based on the similarity of data records to each other within the group. - View Dependent Claims (17, 18, 20, 21)
-
-
19. The machine readable of medium 16 wherein the canonical record is chosen from data records in a group by summing the similarity scores of records found to be similar to each other based on the threshold.
Specification