Efficient fuzzy match for evaluating data records
First Claim
1. A process for testing an evaluation data record having attribute fields containing data comprising:
- providing a reference table having a number of reference records against which a evaluation data record is tested;
identifying reference table tokens contained within the reference records of the reference table and determining a count of tokens in the reference table classified according to attribute field; and
assigning a similarity score to said evaluation data record in relation to a reference record within the reference table based on a combination of;
the number of common tokens of an evaluation field of the input data record and a corresponding field within a reference record from the reference table;
the similarity of the tokens that are not the same in the evaluation field of the input data record and the corresponding field of the reference record from the reference table; and
a weight of the tokens of the evaluation data record that is based on a count of the tokens from a corresponding field contained within the reference table.
2 Assignments
0 Petitions
Accused Products
Abstract
To help ensure high data quality, data warehouses validate and clean, if needed incoming data tuples from external sources. In many situations, input tuples or portions of input tuples must match acceptable tuples in a reference table. For example, product name and description fields in a sales record from a distributor must match the pre-recorded name and description fields in a product reference relation. A disclosed system implements an efficient and accurate approximate or fuzzy match operation that can effectively clean an incoming tuple if it fails to match exactly with any of the multiple tuples in the reference relation. A disclosed similarity function that utilizes token substrings referred to as q-grams overcomes limitations of prior art similarity functions while efficiently performing a fuzzy match process.
-
Citations
46 Claims
-
1. A process for testing an evaluation data record having attribute fields containing data comprising:
-
providing a reference table having a number of reference records against which a evaluation data record is tested;
identifying reference table tokens contained within the reference records of the reference table and determining a count of tokens in the reference table classified according to attribute field; and
assigning a similarity score to said evaluation data record in relation to a reference record within the reference table based on a combination of;
the number of common tokens of an evaluation field of the input data record and a corresponding field within a reference record from the reference table;
the similarity of the tokens that are not the same in the evaluation field of the input data record and the corresponding field of the reference record from the reference table; and
a weight of the tokens of the evaluation data record that is based on a count of the tokens from a corresponding field contained within the reference table. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
-
-
16. A system for evaluating an input data record having fields containing data comprising:
-
a database for storing a reference table having a number of records against which an input data record is evaluated;
a preprocessor component for evaluating records in the reference table to identify tokens and determining a count of tokens in the reference table classified according to record field; and
a matching component for assigning a score to an input data record in relation to a reference record within the reference table based, on a combination of;
i) the number of common tokens of an evaluation field of the input data record and a corresponding field within a reference record from the reference table;
ii) the similarity of the tokens that are not the same in the evaluation field of the input data record and the corresponding field of the reference record from the reference table; and
iii) a weight of the tokens of the evaluation data record that is based on a count of the tokens from the corresponding field contained within the reference table. - View Dependent Claims (17, 18)
-
-
19. A process for evaluating an input data record having attribute fields containing data comprising:
-
providing a number of reference records organized into attribute fields against which an input data record is evaluated;
evaluating reference records to identify tokens from said attribute fields and then evaluating each token to build a vector of token substrings that represent the token;
building an index table wherein entries of the index table contains a token substring and a list of reference records that contain a token that maps to the token substring; and
looking up reference records in the index table based on the contents of the input record and selecting a number of candidate records from the reference records in the index table for comparing to said input data record. - View Dependent Claims (20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33)
-
-
34. A system for evaluating an input data record having fields containing data comprising:
-
a database for storing a reference table having a number of reference records against which an input data record is evaluated;
a preprocessor component for evaluating reference records in the reference table to identify tokens and determining a count of tokens in the reference table classified according to record field;
said preprocessor evaluating reference records to identify tokens from said attribute fields and then evaluating each token to build a H dimensional vector of token substrings that represent the token; and
building an index table wherein entries of the index table contains a token substring, an attribute field, a position within the H dimensional vector, and a list of reference records; and
a matching component for assigning a score to an input data record in relation to a reference record within the reference table by building a candidate record table of candidate records from the index table based on an H dimensional vector of token substrings determined from tokens contained in the input record and assigning a score to said candidate records based on a weight of the tokens of the input data record that is based on a count of the tokens from the corresponding field contained within the reference table.
-
-
35. A data structure for use in evaluating an input data record having fields containing data comprising:
-
a reference table organized in attribute columns having a number of records against which an input data record is evaluated; and
an index table wherein each entry of the index table contains a token substring from a token in the reference table, a column of the reference table having said token from which the token substring is derived, a position within a H dimensional vector based on said token, and a list of records contained within the reference table. - View Dependent Claims (36)
-
-
37. A machine readable medium including instructions for evaluating an input data record having attribute fields containing by steps of:
-
accessing a reference table having a number of records organized into attribute fields against which an input data record is evaluated;
evaluating records in the reference table to identify tokens from said attribute fields and then evaluating each token with a function to build a vector of token substrings that serve as a signature of the token;
building an index table wherein each entry of the index table contains a token substring, a column of the reference table, a position within the vector, and a list of records contained within the reference table; and
looking up records in the index table based on the contents of the input record. - View Dependent Claims (38, 39, 40, 41, 42, 43, 44, 45, 46)
-
Specification