DECLARATIVE FRAMEWORK FOR DEDUPLICATION
First Claim
Patent Images
1. A method for collective deduplication of entity references in data records stored in a database, the method comprising:
- accessing one or more relational tables of the database containing data records, where the data records contain references to varying real-world entities, and where the references include a plurality of sets of two or more entity references that are duplicates, wherein duplicates comprise references that have different respective textual representations of a same real-world entity;
receiving entity-reference declarative program code that declaratively specifies entity references in the relational tables that are to be deduplicated;
receiving constraint-specifying declarative program code that declaratively specifies one or more constraints that a deduplication of the entity references should satisfy; and
generating output by executing on a processor the entity-reference declarative program code and the constraint-specifying declarative program code, the output comprising one or more deduplication relations that identify whether or not two entity references are duplicates, and which satisfy the one or more constraints specified in the constraint-specifying declarative program code, wherein each output deduplication relation is an equivalence relation, wherein each equivalence relation partitions the output into corresponding disjoint subsets
2 Assignments
0 Petitions
Accused Products
Abstract
A system, framework, and algorithms for data deduplication are described. A declarative language, such as a Datalog-type logic language, is provided. Programs in the language describe data to be deduplicated and soft and hard constraints that must/should be satisfied by data deduplicated according to the program. To execute the programs, algorithms for performing graph clustering are described.
32 Citations
20 Claims
-
1. A method for collective deduplication of entity references in data records stored in a database, the method comprising:
-
accessing one or more relational tables of the database containing data records, where the data records contain references to varying real-world entities, and where the references include a plurality of sets of two or more entity references that are duplicates, wherein duplicates comprise references that have different respective textual representations of a same real-world entity; receiving entity-reference declarative program code that declaratively specifies entity references in the relational tables that are to be deduplicated; receiving constraint-specifying declarative program code that declaratively specifies one or more constraints that a deduplication of the entity references should satisfy; and generating output by executing on a processor the entity-reference declarative program code and the constraint-specifying declarative program code, the output comprising one or more deduplication relations that identify whether or not two entity references are duplicates, and which satisfy the one or more constraints specified in the constraint-specifying declarative program code, wherein each output deduplication relation is an equivalence relation, wherein each equivalence relation partitions the output into corresponding disjoint subsets - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
- 9. A computer-readable storage media storing information to enable a computer to perform a method of interactive deduplication of data records in a database, where input to the method comprises a single table containing data records, where each data record in the table corresponds to some real-world entity, and two data records can be duplicates in that they correspond to a same real-world entity, the method comprising receiving interactive input specifying deduplication constraints and data records to be constrained thereby, wherein output of the method comprises a deduplication relation that identifies pairs of input records that are duplicates.
-
15. A computer-implemented method of deduplicating data records, the method, performed by a processor and memory of one or more computers, comprising:
-
accessing stored data records in one or more relational tables, the data records representing respective real world entities, wherein some of the data records comprise duplicates that mutually represent same respective real world entities; receiving strings, in electronic form, constructed by one or more users, each string forming a valid program of a declarative deduplication language, each string specifying, in accordance with the deduplication language, entity references that are to be deduplicated and specifying constraints that corresponding data records, when deduplicated, must or should satisfy; executing one of the strings to generate a deduplication of the data records, the deduplication comprising deduplication relations that identify pairs of entity references among the data records that satisfy the constraints of the executed string; and storing in electronic form indicia of the deduplication. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification