Declarative framework for deduplication
First Claim
Patent Images
1. A method for collective deduplication of entity references in data records stored in a database, the method comprising:
- executing, on a processor, an execution unit that implements a declarative deduplication language using a clustering algorithm and by accessing the database through a database server, wherein the declarative deduplication language is not a Structured Query Language, the execution unit receiving and executing arbitrary programs in the declarative deduplication language;
accessing one or more relational tables of the database containing data records, where the data records contain references to varying real-world entities, and where the references include a plurality of sets of two or more entity references that are duplicates, wherein duplicates comprise references that have different respective textual representations of a same real-world entity;
receiving entity-reference declarative program code of the declarative deduplication language that specifies entity references in the relational tables that are to be deduplicated;
receiving constraint-specifying declarative program code of the declarative deduplication language that specifies one or more constraints that a deduplication of the entity references should satisfy; and
generating output by the execution unit executing the entity-reference declarative program code and the constraint-specifying declarative program code, the output comprising one or more deduplication relations that identify whether or not two entity references are duplicates, and which satisfy the one or more constraints specified in the constraint-specifying declarative program code, wherein each output deduplication relation is an equivalence relation, wherein each equivalence relation partitions the output into corresponding disjoint subsets.
2 Assignments
0 Petitions
Accused Products
Abstract
A system, framework, and algorithms for data deduplication are described. A declarative language, such as a Datalog-type logic language, is provided. Programs in the language describe data to be deduplicated and soft and hard constraints that must/should be satisfied by data deduplicated according to the program. To execute the programs, algorithms for performing graph clustering are described.
20 Citations
20 Claims
-
1. A method for collective deduplication of entity references in data records stored in a database, the method comprising:
-
executing, on a processor, an execution unit that implements a declarative deduplication language using a clustering algorithm and by accessing the database through a database server, wherein the declarative deduplication language is not a Structured Query Language, the execution unit receiving and executing arbitrary programs in the declarative deduplication language; accessing one or more relational tables of the database containing data records, where the data records contain references to varying real-world entities, and where the references include a plurality of sets of two or more entity references that are duplicates, wherein duplicates comprise references that have different respective textual representations of a same real-world entity; receiving entity-reference declarative program code of the declarative deduplication language that specifies entity references in the relational tables that are to be deduplicated; receiving constraint-specifying declarative program code of the declarative deduplication language that specifies one or more constraints that a deduplication of the entity references should satisfy; and generating output by the execution unit executing the entity-reference declarative program code and the constraint-specifying declarative program code, the output comprising one or more deduplication relations that identify whether or not two entity references are duplicates, and which satisfy the one or more constraints specified in the constraint-specifying declarative program code, wherein each output deduplication relation is an equivalence relation, wherein each equivalence relation partitions the output into corresponding disjoint subsets. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A computer-readable storage media storing information to enable a computer to perform a method of interactive deduplication of data records in a database, the method comprising:
-
executing, on a processor, an execution unit that implements a declarative deduplication language by accessing the database through a database server, wherein the declarative deduplication language is not a Structured Query Language, the execution unit receiving and executing arbitrary programs in the declarative deduplication language; accessing one or more relational tables of the database containing data records, where the data records contain references to varying real-world entities, and where the references include a plurality of sets of two or more entity references that are duplicates, wherein duplicates comprise references that have different respective textual representations of a same real-world entity; receiving entity-reference declarative program code of the declarative deduplication language that specifies entity references in the relational tables that are to be deduplicated; receiving constraint-specifying declarative program code of the declarative deduplication language that specifies one or more constraints that a deduplication of the entity references should satisfy; and generating output by the execution unit executing the entity-reference declarative program code and the constraint-specifying declarative program code, the output comprising one or more deduplication relations that identify whether or not two entity references are duplicates, and which satisfy the one or more constraints specified in the constraint-specifying declarative program code, wherein each output deduplication relation is an equivalence relation, wherein each equivalence relation partitions the output into corresponding disjoint subsets. - View Dependent Claims (10, 11, 12, 13, 14)
-
-
15. A computer-implemented method of deduplicating data records, the method, performed by a processor and memory of one or more computers, comprising:
-
executing an implementation of a deduplication language, the deduplication language not comprising a Structured Query Language, the implementation executing arbitrary strings in the deduplication language using a clustering algorithm, the implementation; accessing stored data records in one or more relational tables, the data records representing respective real world entities, wherein some of the data records comprise duplicates that mutually represent same respective real world entities; receiving strings, in electronic form, constructed by one or more users, each string forming a valid program of the deduplication language, the strings specifying, in accordance with the deduplication language, entity references that are to be deduplicated and specifying constraints that corresponding data records, when deduplicated, must or should satisfy; executing one of the strings to generate a deduplication of the data records, the deduplication comprising deduplication relations that identify pairs of entity references among the data records that satisfy the constraints of the executed string; and storing in electronic form indicia of the deduplication. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification