Example-driven design of efficient record matching queries

US 8,046,339 B2
Filed: 06/05/2007
Issued: 10/25/2011
Est. Priority Date: 06/05/2007
Status: Expired due to Fees

First Claim

Patent Images

1. A computer-implemented query system, comprising:

an input component configured to receive an example set of records from two input relations, the example set comprising;

pairs of matching records that are labeled as examples of records that are considered a match between the two input relations; and

pairs of non-matching records that are labeled as examples of records that are not considered a match between the two input relations; and

a modeling component configured to;

generate an operator tree based on the example set of records from the two input relations, wherein, to generate the operator tree, the modeling component is further configured to;

map the pairs of matching records to positive points in a similarity space based on a similarity function;

map the pairs of non-matching records to negative points in the similarity space based on the similarity function;

generate one or more similarity joins of the operator tree, based on the positive points and the negative points in the similarity space;

limit the operator tree to a maximum number of the similarity joins; and

limit individual similarity joins of the operator tree to a maximum number of similarity function predicates; and

generate a query based on the operator tree, the query being configured to identify individual matching records between the two input relations; and

one or more processors configured to execute the input component or the modeling component.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Example-driven creation of record matching queries. The disclosed architecture employs techniques that exploit the availability of positive (or matching) and negative (non-matching) examples to search through this space and suggest an initial record matching query. The record matching task is modeled as that of designing an operator tree obtained by composing a few primitive operators. This ensures that record matching programs be executable efficiently and scalably over large input relations. The architecture joins records across multiple (e.g., two) relations (e.g., R and S). The architecture exploits the monotonicity property of similarity functions for record matching in the relations, in that, any pair of matching records have a higher similarity value than non-matching record pairs on at least one similarity function.

Citations

20 Claims

1. A computer-implemented query system, comprising:
- an input component configured to receive an example set of records from two input relations, the example set comprising;
  
  pairs of matching records that are labeled as examples of records that are considered a match between the two input relations; and
  
  pairs of non-matching records that are labeled as examples of records that are not considered a match between the two input relations; and
  
  a modeling component configured to;
  
  generate an operator tree based on the example set of records from the two input relations, wherein, to generate the operator tree, the modeling component is further configured to;
  
  map the pairs of matching records to positive points in a similarity space based on a similarity function;
  
  map the pairs of non-matching records to negative points in the similarity space based on the similarity function;
  
  generate one or more similarity joins of the operator tree, based on the positive points and the negative points in the similarity space;
  
  limit the operator tree to a maximum number of the similarity joins; and
  
  limit individual similarity joins of the operator tree to a maximum number of similarity function predicates; and
  
  generate a query based on the operator tree, the query being configured to identify individual matching records between the two input relations; and
  
  one or more processors configured to execute the input component or the modeling component.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The system of claim 1, wherein the operator tree employs an attribute value transformation operator.
  - 3. The system of claim 1, wherein the modeling component is configured to generate the operator tree based on a monotonic property of the similarity function.
  - 4. The system of claim 1, further comprising a quality component configured to quantify a quality of the operator tree based on the matching and non-matching records.
  - 5. The system of claim 4, wherein a quality of the query, as represented by the quality of the operator tree, is based on the non-matching records being less than a user-specified fraction of matching records in a result of the query.
  - 6. The system of claim 1, wherein the example set is provided manually.
  - 7. The system of claim 1, wherein the operator tree is configured to be modified manually.
  - 8. The system of claim 1, wherein the two input relations are joined based on a join predicate that is a conjunction of thresholded similarity values.

9. One or more computer-readable storage devices comprising executable instructions that, when executed, cause at least one processor to perform:
- receiving an example set of records from two input relations, the example set comprising;
  
  pairs of matching records that are labeled as examples of records that are considered a match between the two input relations; and
  
  pairs of non-matching records that are labeled as examples of records that are not considered a match between the two input relations;
  
  generating an operator tree based on the example set of records from the two input relations, wherein the generating the operator tree comprises;
  
  mapping the pairs of matching records to positive points in a similarity space based on a similarity function;
  
  mapping the pairs of non-matching records to negative points in the similarity space based on the similarity function;
  
  generating one or more similarity joins of the operator tree based on the positive points and the negative points in the similarity space;
  
  limiting the operator tree to a maximum number of the similarity joins; and
  
  limiting individual similarity joins of the operator tree to a maximum number of similarity function predicates; and
  
  generating a query based on the operator tree, the query identifying individual matching records between the two input relations.
- View Dependent Claims (10, 11, 12, 13, 14)
- - 10. The one or more computer-readable storage devices of claim 9, wherein the operator tree employs an attribute value transformation operator.
  - 11. The one or more computer-readable storage devices of claim 9, wherein the operator tree is generated based on a monotonic property of the similarity function.
  - 12. The one or more computer-readable storage devices of claim 9, further comprising executable instructions that, when executed, cause the at least one processor to perform:
    - quantifying a quality of the operator tree based on the matching and non-matching records.
  - 13. The one or more computer-readable storage devices of claim 12, wherein a quality of the query, as represented by the quality of the operator tree, is based on the non-matching records being less than a user-specified fraction of matching records in a result of the query.
  - 14. The one or more computer-readable storage devices of claim 9, wherein the two input relations are joined based on a join predicate that is a conjunction of thresholded similarity values.

15. A method comprising steps of:
- receiving an example set of records from two input relations, the example set comprising;
  
  pairs of matching records that are labeled as examples of records that are considered a match between the two input relations; and
  
  pairs of non-matching records that are labeled as examples of records that are not considered a match between the two input relations; and
  
  generating an operator tree based on the example set of records from the two input relations, wherein the generating the operator tree comprises;
  
  mapping the pairs of matching records to positive points in a similarity space based on a similarity function;
  
  mapping the pairs of non-matching records to negative points in the similarity space based on the similarity function;
  
  generating one or more similarity joins of the operator tree based on the positive points and the negative points in the similarity space;
  
  limiting the operator tree to a maximum number of the similarity joins; and
  
  limiting individual similarity joins of the operator tree to a maximum number of similarity function predicates; and
  
  generating a query based on the operator tree, the query identifying individual matching records between the two input relations,wherein at least one of the steps is performed by one or more computing devices.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The method of claim 15, wherein the operator tree employs an attribute value transformation operator.
  - 17. The method of claim 15, wherein the operator tree is generated based on a monotonic property of the similarity function.
  - 18. The method of claim 15, further comprising:
    - quantifying a quality of the operator tree based on the matching and non-matching records.
  - 19. The method of claim 18, wherein a quality of the query, as represented by the quality of the operator tree, is based on the non-matching records being less than a user-specified fraction of matching records in a result of the query.
  - 20. The method of claim 15, wherein the two input relations are joined based on a join predicate that is a conjunction of thresholded similarity values.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Kaushik, Shriraghav, Chen, Bee Chung, Chaudhuri, Surajit, Ganti, Venkatesh
Primary Examiner(s)
Alam; Shahid
Assistant Examiner(s)
WILLIS, AMANDA LYNN

Application Number

US11/758,202
Publication Number

US 20080306945A1
Time in Patent Office

1,603 Days
Field of Search

707/714, 707/692, 707/797, 707/E17.002
US Class Current

707/692
CPC Class Codes

G06F 16/24558 Binary matching operations

G06F 16/2458 Special types of queries, e...

Example-driven design of efficient record matching queries

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Example-driven design of efficient record matching queries

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links