Declarative framework for deduplication

US 8,200,640 B2
Filed: 06/15/2009
Issued: 06/12/2012
Est. Priority Date: 06/15/2009
Status: Active Grant

First Claim

Patent Images

1. A method for collective deduplication of entity references in data records stored in a database, the method comprising:

executing, on a processor, an execution unit that implements a declarative deduplication language using a clustering algorithm and by accessing the database through a database server, wherein the declarative deduplication language is not a Structured Query Language, the execution unit receiving and executing arbitrary programs in the declarative deduplication language;

accessing one or more relational tables of the database containing data records, where the data records contain references to varying real-world entities, and where the references include a plurality of sets of two or more entity references that are duplicates, wherein duplicates comprise references that have different respective textual representations of a same real-world entity;

receiving entity-reference declarative program code of the declarative deduplication language that specifies entity references in the relational tables that are to be deduplicated;

receiving constraint-specifying declarative program code of the declarative deduplication language that specifies one or more constraints that a deduplication of the entity references should satisfy; and

generating output by the execution unit executing the entity-reference declarative program code and the constraint-specifying declarative program code, the output comprising one or more deduplication relations that identify whether or not two entity references are duplicates, and which satisfy the one or more constraints specified in the constraint-specifying declarative program code, wherein each output deduplication relation is an equivalence relation, wherein each equivalence relation partitions the output into corresponding disjoint subsets.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system, framework, and algorithms for data deduplication are described. A declarative language, such as a Datalog-type logic language, is provided. Programs in the language describe data to be deduplicated and soft and hard constraints that must/should be satisfied by data deduplicated according to the program. To execute the programs, algorithms for performing graph clustering are described.

20 Citations

View as Search Results

20 Claims

1. A method for collective deduplication of entity references in data records stored in a database, the method comprising:
- executing, on a processor, an execution unit that implements a declarative deduplication language using a clustering algorithm and by accessing the database through a database server, wherein the declarative deduplication language is not a Structured Query Language, the execution unit receiving and executing arbitrary programs in the declarative deduplication language;
  
  accessing one or more relational tables of the database containing data records, where the data records contain references to varying real-world entities, and where the references include a plurality of sets of two or more entity references that are duplicates, wherein duplicates comprise references that have different respective textual representations of a same real-world entity;
  
  receiving entity-reference declarative program code of the declarative deduplication language that specifies entity references in the relational tables that are to be deduplicated;
  
  receiving constraint-specifying declarative program code of the declarative deduplication language that specifies one or more constraints that a deduplication of the entity references should satisfy; and
  
  generating output by the execution unit executing the entity-reference declarative program code and the constraint-specifying declarative program code, the output comprising one or more deduplication relations that identify whether or not two entity references are duplicates, and which satisfy the one or more constraints specified in the constraint-specifying declarative program code, wherein each output deduplication relation is an equivalence relation, wherein each equivalence relation partitions the output into corresponding disjoint subsets.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. A method according to claim 1, wherein the entity-reference declarative program code specifies the entity references by defining one or more relational views over the relational tables using an arbitrary view definition language.
  - 3. A method according to claim 2, wherein each relational view identifies one class of entity references that need to be deduplicated and wherein there can be more than one class of entities.
  - 4. A method according to claim 2, wherein the output comprises one deduplication relation for each entity reference relational view, respectively, wherein a deduplication relation contains all pairs of entity references in the corresponding view that are duplicates referring to a same real-world entity.
  - 5. A method according to claim 1, wherein the constraints specify properties that the deduplication of entity references should satisfy to the extent possible, wherein the constraints comprise both hard constraints and soft constraints, wherein hard constraints must be satisfied by the output deduplication, and wherein soft constraints are hints and the executing comprises attempting to satisfy as many of the soft constraints as possible.
  - 6. A method according to claim 2, wherein the constraints comprise statements in first order logic expressed using Datalog rules, where each Datalog rule comprises a head and a body, where the head comprises an intensional database predicate that corresponds to one of the deduplication relations that forms the output, where the body comprises a conjunction of one or more extensional and intensional database predicates, and where an intensional database predicate corresponds to one of the deduplication relations and an extensional database predicates corresponds to one of the relational tables or an entity reference view.
  - 7. A method according to claim 1 wherein the generating is performed with a clustering algorithm that clusters complete undirected graphs where each edge of a graph is marked with one of four labels comprising a first, second, third, and fourth label, and where the algorithm produces a clustering such that for any edge labeled with the third label the vertices of the edge are placed in the same cluster, for any edge labeled with the fourth label, the vertices of the edge are placed in different clusters, and the sum of number of edges labeled with the first label whose vertices are in different clusters and the number of edges labeled with the second label whose vertices are in the same cluster is minimized.
  - 8. A method according to claim 7 wherein the generating further comprises using an algorithm comprising forward voting and backward propagation to transform the deduplication problem to one or more instances of a clustering problem to be solved by the clustering algorithm.

9. A computer-readable storage media storing information to enable a computer to perform a method of interactive deduplication of data records in a database, the method comprising:
- executing, on a processor, an execution unit that implements a declarative deduplication language by accessing the database through a database server, wherein the declarative deduplication language is not a Structured Query Language, the execution unit receiving and executing arbitrary programs in the declarative deduplication language;
  
  accessing one or more relational tables of the database containing data records, where the data records contain references to varying real-world entities, and where the references include a plurality of sets of two or more entity references that are duplicates, wherein duplicates comprise references that have different respective textual representations of a same real-world entity;
  
  receiving entity-reference declarative program code of the declarative deduplication language that specifies entity references in the relational tables that are to be deduplicated;
  
  receiving constraint-specifying declarative program code of the declarative deduplication language that specifies one or more constraints that a deduplication of the entity references should satisfy; and
  
  generating output by the execution unit executing the entity-reference declarative program code and the constraint-specifying declarative program code, the output comprising one or more deduplication relations that identify whether or not two entity references are duplicates, and which satisfy the one or more constraints specified in the constraint-specifying declarative program code, wherein each output deduplication relation is an equivalence relation, wherein each equivalence relation partitions the output into corresponding disjoint subsets.
- View Dependent Claims (10, 11, 12, 13, 14)
- - 10. A computer-readable storage media according to claim 9, wherein potential duplicates are identified based on string similarity thereof.
  - 11. A computer-readable storage media according to claim 9, wherein a user interacts with a user-interface module to label selected pairs of data records as duplicates or not.
  - 12. A computer-readable storage media according to claim 9, wherein the method combines interactive user labeling and the output of record matching to produce a deduplication using an algorithm for clustering complete undirected graphs.
  - 13. A computer-readable storage media according to claim 9, wherein the constraints comprise statements in first order logic expressed using Datalog rules.
  - 14. A computer-readable storage media according to claim 13, wherein each Datalog rule comprises a head and a body, where the head comprises an intensional database predicate that corresponds to one of the deduplication relations that forms the output, where the body comprises a conjunction of one or more extensional and intensional database predicates, and where an intensional database predicate corresponds to one of the deduplication relations and an extensional database predicates corresponds to one of the relational tables or an entity reference view.

15. A computer-implemented method of deduplicating data records, the method, performed by a processor and memory of one or more computers, comprising:
- executing an implementation of a deduplication language, the deduplication language not comprising a Structured Query Language, the implementation executing arbitrary strings in the deduplication language using a clustering algorithm, the implementation;
  
  accessing stored data records in one or more relational tables, the data records representing respective real world entities, wherein some of the data records comprise duplicates that mutually represent same respective real world entities;
  
  receiving strings, in electronic form, constructed by one or more users, each string forming a valid program of the deduplication language, the strings specifying, in accordance with the deduplication language, entity references that are to be deduplicated and specifying constraints that corresponding data records, when deduplicated, must or should satisfy;
  
  executing one of the strings to generate a deduplication of the data records, the deduplication comprising deduplication relations that identify pairs of entity references among the data records that satisfy the constraints of the executed string; and
  
  storing in electronic form indicia of the deduplication.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. A computer-implemented method according to claim 15, wherein the executed string specifies a first table of data records to be deduplicated, and the string also specifies a constraint on a second table related to the first table, wherein when the string is executed the constraint on the second table is satisfied in the resulting deduplicated set of data records.
  - 17. A computer-implemented method according to claim 15, wherein the deduplication language allows for specification of constraints in the form of rules, the rules including at least:
    - a soft complete rule which causes the module to give bias to deduplication matches if and only if they satisfy the soft complete rule;
      
      a soft-incomplete rule which causes the module to give bias against deduplication matches that violate the soft-incomplete rule;
      
      a hard rule comprising a rule that must be satisfied by any data records deemed to be duplicates; and
      
      a complex hard rule specifying that data records in the second table that are related to deduplicate-matching data records in the first table must be matching deduplicates.
  - 18. A computer-implemented method according to claim 15, wherein one of the received strings that conforms to the declarative deduplication language comprises a plurality of predicates of the language, a predicate comprising a logic operator having as parameters information identifying respective columns of a database in which the data records are stored.
  - 19. A computer-implemented method according to claim 15, wherein the string is executed by minimizing the number of soft constraints that are violated by deduplicate match clusters.
  - 20. A computer-implemented method according to claim 15, wherein the executing is performed with a clustering algorithm that clusters complete undirected graphs.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Arasu, Arvind, Re, Christopher, Suciu, Dan
Primary Examiner(s)
NGUYEN, PHONG H

Application Number

US12/484,406
Publication Number

US 20100318499A1
Time in Patent Office

1,093 Days
Field of Search

707/664, 707/692, 707/798, 707/805
US Class Current

707/692
CPC Class Codes

G06F 16/215 Improving data quality; Dat...

G06F 16/24556 Aggregation; Duplicate elim...

Declarative framework for deduplication

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

20 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Declarative framework for deduplication

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

20 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links