Duplicate data elimination system
First Claim
1. A computer implemented method for removing similar data records from a set of data records comprising:
- providing a number of data records from the set of data records, from which one or more canonical data records are to be identified;
determining similarity scores for the number of data records based on the contents of the data records;
grouping together data records whose similarity score with respect to each other is greater than a threshold, wherein one or more groups of data records form nodes of a graph, and further wherein edges between nodes represent a similarity score between data records of a group; and
within each said group of data records, identifying a canonical record based on the similarity of data records to each other within each group,wherein the set of data records is from a database table and within each group of data records, data records similar to the canonical record of the group are removed from the database table while leaving the canonical record stored in the database table,wherein the similarity scores are determined by identifying tokens contained within data records and distinguishing the tokens according to respective attribute fields associated with the tokens and assigning a similarity score to data records in relation to other data records based on a similarity between tokens of said data records and said other data records, wherein a same token in two different data records is evaluated differently when assigning similarity score if the same token in the two different data records is associated with two different attribute fields respectively.
2 Assignments
0 Petitions
Accused Products
Abstract
A process for finding a similar data records from a set of data records. A database table or tables provide a number of data records from which one or more canonical data records are identified. Tokens are identified within the data records and classified according to attribute field. A similarity score is assigned to data records in relation to other data records based on a similarity between tokens of the data records. Data records whose similarity score with respect to each other is greater than a threshold form one or more groups of data records. The records or tuples form nodes of a graph wherein edges between nodes represent a similarity score between records of a group. Within each group a canonical record is identified based on the similarity of data records to each other within the group.
67 Citations
12 Claims
-
1. A computer implemented method for removing similar data records from a set of data records comprising:
-
providing a number of data records from the set of data records, from which one or more canonical data records are to be identified; determining similarity scores for the number of data records based on the contents of the data records; grouping together data records whose similarity score with respect to each other is greater than a threshold, wherein one or more groups of data records form nodes of a graph, and further wherein edges between nodes represent a similarity score between data records of a group; and within each said group of data records, identifying a canonical record based on the similarity of data records to each other within each group, wherein the set of data records is from a database table and within each group of data records, data records similar to the canonical record of the group are removed from the database table while leaving the canonical record stored in the database table, wherein the similarity scores are determined by identifying tokens contained within data records and distinguishing the tokens according to respective attribute fields associated with the tokens and assigning a similarity score to data records in relation to other data records based on a similarity between tokens of said data records and said other data records, wherein a same token in two different data records is evaluated differently when assigning similarity score if the same token in the two different data records is associated with two different attribute fields respectively. - View Dependent Claims (2, 3, 4)
-
-
5. A system for removing duplicate data records from a set of data records comprising:
-
a processor; an application program executed by the processor for; providing a number of data records from the set of data records, from which one or more canonical data records are to be identified; determining similarity scores for the number of data records based on the contents of the data records; grouping together data records whose similarity score with respect to each other is greater than a threshold, wherein one or more groups of data records form nodes of a graph, and further wherein edges between nodes represent a similarity score between data records of a group; and within each said group of data records, identifying a canonical record based on the similarity of data records to each other within each group, wherein the set of data records is from a database table and within each group of data records, data records similar to the canonical record of the group are removed from the database table while leaving the canonical record stored in the database table, wherein the similarity scores are determined by identifying tokens contained within data records and distinguishing the tokens according to respective attribute fields associated with the tokens and assigning a similarity score to data records in relation to other data records based on a similarity between tokens of said data records and said other data records, wherein a same token in two different data records is evaluated differently when assigning similarity score if the same token in the two different data records is associated with two different attribute fields respectively. - View Dependent Claims (6, 7, 8)
-
-
9. A machine readable medium including instructions for executing on a computer for finding and removing similar data records from a set of data records comprising instructions for:
-
providing a number of data records from the set of data records, from which one or more canonical data records are to be identified; determining similarity scores for the number of data records based on the contents of the data records; grouping together data records whose similarity score with respect to each other is greater than a threshold, wherein one or more groups of data records form nodes of a graph, and further wherein edges between nodes represent a similarity score between data records of a group; and within each said group of data records, identifying a canonical record based on the similarity of data records to each other within each group, wherein the set of data records is from a database table and within each group of data records, data records similar to the canonical record of the group are removed from the database table while leaving the canonical record stored in the database table, wherein the similarity scores are determined by identifying tokens contained within data records and distinguishing the tokens according to respective attribute fields associated with the tokens and assigning a similarity score to data records in relation to other data records based on a similarity between tokens of said data records and said other data records, wherein a same token in two different data records is evaluated differently when assigning similarity score if the same token in the two different data records is associated with two different attribute fields respectively. - View Dependent Claims (10, 11)
-
-
12. The machine readable of medium 9 wherein the canonical record is chosen from data records in a group by summing the similarity scores of records found to be similar to each other based on the threshold.
Specification