Example-driven design of efficient record matching queries
First Claim
1. A computer-implemented query system, comprising:
- an input component configured to receive an example set of records from two input relations, the example set comprising;
pairs of matching records that are labeled as examples of records that are considered a match between the two input relations; and
pairs of non-matching records that are labeled as examples of records that are not considered a match between the two input relations; and
a modeling component configured to;
generate an operator tree based on the example set of records from the two input relations, wherein, to generate the operator tree, the modeling component is further configured to;
map the pairs of matching records to positive points in a similarity space based on a similarity function;
map the pairs of non-matching records to negative points in the similarity space based on the similarity function;
generate one or more similarity joins of the operator tree, based on the positive points and the negative points in the similarity space;
limit the operator tree to a maximum number of the similarity joins; and
limit individual similarity joins of the operator tree to a maximum number of similarity function predicates; and
generate a query based on the operator tree, the query being configured to identify individual matching records between the two input relations; and
one or more processors configured to execute the input component or the modeling component.
2 Assignments
0 Petitions
Accused Products
Abstract
Example-driven creation of record matching queries. The disclosed architecture employs techniques that exploit the availability of positive (or matching) and negative (non-matching) examples to search through this space and suggest an initial record matching query. The record matching task is modeled as that of designing an operator tree obtained by composing a few primitive operators. This ensures that record matching programs be executable efficiently and scalably over large input relations. The architecture joins records across multiple (e.g., two) relations (e.g., R and S). The architecture exploits the monotonicity property of similarity functions for record matching in the relations, in that, any pair of matching records have a higher similarity value than non-matching record pairs on at least one similarity function.
-
Citations
20 Claims
-
1. A computer-implemented query system, comprising:
-
an input component configured to receive an example set of records from two input relations, the example set comprising; pairs of matching records that are labeled as examples of records that are considered a match between the two input relations; and pairs of non-matching records that are labeled as examples of records that are not considered a match between the two input relations; and a modeling component configured to; generate an operator tree based on the example set of records from the two input relations, wherein, to generate the operator tree, the modeling component is further configured to; map the pairs of matching records to positive points in a similarity space based on a similarity function; map the pairs of non-matching records to negative points in the similarity space based on the similarity function; generate one or more similarity joins of the operator tree, based on the positive points and the negative points in the similarity space; limit the operator tree to a maximum number of the similarity joins; and limit individual similarity joins of the operator tree to a maximum number of similarity function predicates; and generate a query based on the operator tree, the query being configured to identify individual matching records between the two input relations; and one or more processors configured to execute the input component or the modeling component. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. One or more computer-readable storage devices comprising executable instructions that, when executed, cause at least one processor to perform:
-
receiving an example set of records from two input relations, the example set comprising; pairs of matching records that are labeled as examples of records that are considered a match between the two input relations; and pairs of non-matching records that are labeled as examples of records that are not considered a match between the two input relations; generating an operator tree based on the example set of records from the two input relations, wherein the generating the operator tree comprises; mapping the pairs of matching records to positive points in a similarity space based on a similarity function; mapping the pairs of non-matching records to negative points in the similarity space based on the similarity function; generating one or more similarity joins of the operator tree based on the positive points and the negative points in the similarity space; limiting the operator tree to a maximum number of the similarity joins; and limiting individual similarity joins of the operator tree to a maximum number of similarity function predicates; and generating a query based on the operator tree, the query identifying individual matching records between the two input relations. - View Dependent Claims (10, 11, 12, 13, 14)
-
-
15. A method comprising steps of:
-
receiving an example set of records from two input relations, the example set comprising; pairs of matching records that are labeled as examples of records that are considered a match between the two input relations; and pairs of non-matching records that are labeled as examples of records that are not considered a match between the two input relations; and generating an operator tree based on the example set of records from the two input relations, wherein the generating the operator tree comprises; mapping the pairs of matching records to positive points in a similarity space based on a similarity function; mapping the pairs of non-matching records to negative points in the similarity space based on the similarity function; generating one or more similarity joins of the operator tree based on the positive points and the negative points in the similarity space; limiting the operator tree to a maximum number of the similarity joins; and limiting individual similarity joins of the operator tree to a maximum number of similarity function predicates; and generating a query based on the operator tree, the query identifying individual matching records between the two input relations, wherein at least one of the steps is performed by one or more computing devices. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification