Efficient fuzzy match for evaluating data records
First Claim
1. A process for testing an evaluation data record having attribute fields containing data comprising:
- providing a reference table having a number of reference records against which a evaluation data record is tested;
identifying reference table tokens contained within the reference records of the reference table and determining a count of tokens in the reference table classified according to attribute field; and
assigning a similarity score to said evaluation data record in relation to a reference record within the reference table based on a combination of;
the number of common tokens of an evaluation field of the evaluation data record and a corresponding field within a reference record from the reference table;
the similarity of the tokens that are not the same in the evaluation field of the evaluation data record and the corresponding field of the reference record from the reference table; and
a weight of the tokens of the evaluation data record that is based on a count of the tokens from a corresponding field contained within the reference table; and
wherein once a likely reference record that matches the evaluation data record with a specified degree of certainty is found, further searching for records in the reference table is stopped.
2 Assignments
0 Petitions
Accused Products
Abstract
To help ensure high data quality, data warehouses validate and clean, if needed incoming data tuples from external sources. In many situations, input tuples or portions of input tuples must match acceptable tuples in a reference table. For example, product name and description fields in a sales record from a distributor must match the pre-recorded name and description fields in a product reference relation. A disclosed system implements an efficient and accurate approximate or fuzzy match operation that can effectively clean an incoming tuple if it fails to match exactly with any of the multiple tuples in the reference relation. A disclosed similarity function that utilizes token substrings referred to as q-grams overcomes limitations of prior art similarity functions while efficiently performing a fuzzy match process.
-
Citations
42 Claims
-
1. A process for testing an evaluation data record having attribute fields containing data comprising:
-
providing a reference table having a number of reference records against which a evaluation data record is tested; identifying reference table tokens contained within the reference records of the reference table and determining a count of tokens in the reference table classified according to attribute field; and assigning a similarity score to said evaluation data record in relation to a reference record within the reference table based on a combination of; the number of common tokens of an evaluation field of the evaluation data record and a corresponding field within a reference record from the reference table; the similarity of the tokens that are not the same in the evaluation field of the evaluation data record and the corresponding field of the reference record from the reference table; and a weight of the tokens of the evaluation data record that is based on a count of the tokens from a corresponding field contained within the reference table; and wherein once a likely reference record that matches the evaluation data record with a specified degree of certainty is found, further searching for records in the reference table is stopped. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A process for testing an evaluation data record having attribute fields containing data comprising:
-
providing a reference table having a number of reference records against which a evaluation data record is tested; identifying reference table tokens contained within the reference records of the reference table and determining a count of tokens in the reference table classified according to attribute field; assigning a similarity score to said evaluation data record in relation to a reference record within the reference table based on a combination of; the number of common tokens of an evaluation field of the input data record and a corresponding field within a reference record from the reference table; the similarity of the tokens that are not the same in the evaluation field of the input data record and the corresponding field of the reference record from the reference table; and a weight of the tokens of the evaluation data record that is based on a count of the tokens from a corresponding field contained within the reference table; wherein a look-up table based on contents of reference records in the reference table is prepared before evaluation of the evaluation data record and wherein the tokens of the evaluation data record are evaluated by comparing the contents of the look-up table with contents of the tokens of said evaluation data record to prepare a candidate set of reference records for which a similarity score is assigned; evaluating tokens in the reference table by; breaking tokens in the reference table up into sets of substrings having a length q; applying a function to the set of substrings for a token to provide a vector representative of a token; and building a lookup table for substrings found within the tokens that make up the reference table, wherein the process of building the lookup table creates an entry for each substring comprising;
an attribute field for said substring, a co-ordinate within a vector for said substring, a frequency of said substring, and a list of reference records where said substring appears in the specified attribute field and vector co-ordinate position;wherein a candidate record table is built and records listed in the lookup table are added to the candidate record table based on vector representations of the tokens of the input record; and wherein once a likely reference record that matches the evaluation data record with a specified degree of certainty is found, further searching for records in the reference table is stopped. - View Dependent Claims (13)
-
-
14. A process for testing an evaluation data record having attribute fields containing data comprising:
-
providing a reference table having a number of reference records against which a evaluation data record is tested; identifying reference table tokens contained within the reference records of the reference table and determining a count of tokens in the reference table classified according to attribute field; assigning a similarity score to said evaluation data record in relation to a reference record within the reference table based on a combination of; the number of common tokens of an evaluation field of the input data record and a corresponding field within a reference record from the reference table; the similarity of the tokens that are not the same in the evaluation field of the input data record and the corresponding field of the reference record from the reference table; and a weight of the tokens of the evaluation data record that is based on a count of the tokens from a corresponding field contained within the reference table; and maintaining a token frequency cache in a high speed access memory for use in assigning weights to said tokens.
-
-
15. A system for evaluating an input data record having fields containing data comprising:
-
a database for storing a reference table having a number of records against which an input data record is evaluated; a preprocessor component for evaluating records in the reference table to identify tokens and determining a count of tokens in the reference table classified according to record field; and a matching component for assigning a score to an input data record in relation to a reference record within the reference table based on a combination of; i) the number of common tokens of an evaluation field of the input data record and a corresponding field within a reference record from the reference table; ii) the similarity of the tokens that are not the same in the evaluation field of the input data record and the corresponding field of the reference record from the reference table; and iii) a weight of the tokens of the input data record that is based on a count of the tokens from the corresponding field contained within the reference table; and wherein once a likely reference record that matches the input data record with a specified degree of certainty is found, further searching for records in the reference table is stopped. - View Dependent Claims (16, 17)
-
-
18. A process for evaluating an input data record having attribute fields containing data comprising:
-
providing a number of reference records organized into attribute fields against which an input data record is evaluated; evaluating reference records to identify tokens from said attribute fields and then evaluating each token to build a vector of token substrings that represent the token; building an index table wherein entries of the index table contains a token substring and a list of reference records that contain a token that maps to the token substring; and looking up reference records in the index table based on the contents of the input record and selecting a number of candidate records from the reference records in the index table for comparing to said input data record, wherein once a likely reference record that matches the input data record with a specified degree of certainty is found, further searching for reference records in the index table is stopped. - View Dependent Claims (19, 20, 21, 22, 23, 24, 25, 26, 27, 28)
-
-
29. A process for evaluating an input data record having attribute fields containing data comprising:
-
providing a number of reference records organized into attribute fields against which an input data record is evaluated; evaluating reference records to identify tokens from said attribute fields and then evaluating each token to build a vector of token substrings that represent the token; building an index table wherein entries of the index table contains a token substring and a list of reference records that contain a token that maps to the token substring; looking up reference records in the index table based on the contents of the input record and selecting a number of candidate records from the reference records in the index table for comparing to said input data record; wherein a candidate record table is built and candidate records from the index table are added to a candidate record table based on an H dimensional vector of token substrings determined from tokens contained in the input record; wherein a candidate record is added to the candidate record table only if a possible score assigned to the reference record in the reference table can exceed a threshold based on an already evaluated substring; and wherein once a likely reference record that matches the input data record with a specified degree of certainty is found, further searching for reference records in the index table is stopped.
-
-
30. A process for evaluating an input data record having attribute fields containing data comprising:
-
providing a number of reference records organized into attribute fields against which an input data record is evaluated; evaluating reference records to identify tokens from said attribute fields and then evaluating each token to build a vector of token substrings that represent the token; building an index table wherein entries of the index table contains a token substring and a list of reference records that contain a token that maps to the token substring; looking up reference records in the index table based on the contents of the input record and selecting a number of candidate records from the reference records in the index table for comparing to said input data record; assigning a similarity score to said input data record in relation to a candidate set of reference records based on a combination of; the number of common tokens of an evaluation field of the input data record and a corresponding field within a reference record; the similarity of the tokens that are not the same in the evaluation field of the input data record and the corresponding field of the reference record; and a weight of the tokens in the evaluation field of the input data record based on a count of the tokens from the corresponding field contained within the reference records; and maintaining a token frequency cache in a high speed access memory for use in assigning weights to said tokens.
-
-
31. A system for evaluating an input data record having fields containing data comprising:
-
a database for storing a reference table having a number of reference records against which an input data record is evaluated; a preprocessor component for evaluating reference records in the reference table to identify tokens and determining a count of tokens in the reference table classified according to record field;
said preprocessor evaluating reference records to identify tokens from said attribute fields and then evaluating each token to build a H dimensional vector of token substrings that represent the token; and
building an index table wherein entries of the index table contains a token substring , an attribute field, a position within the H dimensional vector, and a list of reference records; anda matching component for assigning a score to an input data record in relation to a reference record within the reference table by building a candidate record table of candidate records from the index table based on an H dimensional vector of token substrings determined from tokens contained in the input record and assigning a score to said candidate records based on a weight of the tokens of the input data record that is based on a count of the tokens from the corresponding field contained within the reference table; and wherein once a likely reference record that matches the input data record with a specified degree of certainty is found, further searching for reference records in the reference table is stopped.
-
-
32. A data structure encoded on a computer readable medium for use in evaluating an input data record having fields containing data comprising:
-
a reference table organized in attribute columns having a number of records against which an input data record is evaluated; and an index table wherein each entry of the index table contains a token substring from a token in the reference table, a column of the reference table having said token from which the token substring is derived, a position within a H dimensional vector based on said token, and a list of records contained within the reference table, wherein once a likely record in the reference table matches the input data record with a specified degree of certainty is found, further searching for reference records in the reference table is stopped. - View Dependent Claims (33)
-
-
34. A machine readable medium including instructions for evaluating an input data record having attribute fields containing data by steps of:
-
accessing a reference table having a number of records organized into attribute fields against which an input data record is evaluated; evaluating records in the reference table to identify tokens from said attribute fields and then evaluating each token with a function to build a vector of token substrings that serve as a signature of the token; building an index table wherein each entry of the index table contains a token substring, a column of the reference table, a position within the vector, and a list of records contained within the reference table; and looking up records in the index table based on the contents of the input record, wherein once a likely reference record that matches the input data record with a specified degree of certainty is found, further searching for records in the reference table is stopped. - View Dependent Claims (35, 36, 37, 38, 39, 40)
-
-
41. A machine readable medium including instructions for evaluating an input data record having attribute fields containing data by steps of:
-
accessing a reference table having a number of records organized into attribute fields against which an input data record is evaluated; evaluating records in the reference table to identify tokens from said attribute fields and then evaluating each token with a function to build a vector of token substrings that serve as a signature of the token; building an index table wherein each entry of the index table contains a token substring, a column of the reference table, a position within the vector, and a list of records contained within the reference table; looking up records in the index table based on the contents of the input record; wherein a candidate record table is built and records from the index table are added to a candidate record table based on vector substring representations of the tokens of the input record; and wherein once a likely reference record that matches the input data record with a specified degree of certainty is found, further searching for records in the reference table is stopped.
-
-
42. A machine readable medium including instructions for evaluating an input data record having attribute fields containing data by steps of:
-
accessing a reference table having a number of records organized into attribute fields against which an input data record is evaluated; evaluating records in the reference table to identify tokens from said attribute fields and then evaluating each token with a function to build a vector of token substrings that serve as a signature of the token; building an index table wherein each entry of the index table contains a token substring, a column of the reference table, a position within the vector, and a list of records contained within the reference table; looking up records in the index table based on the contents of the input record; assigning a similarity score to said input data record in relation to a candidate set of reference records within the reference table based on a combination of; the number of common tokens of an evaluation field of the input data record and a corresponding field within a reference record from the reference table; the similarity of the tokens that are not the same in the evaluation field of the input data record and the corresponding field of the reference record from the reference table; and a weight of the tokens in the evaluation field of the input data record that is based on a count of the tokens from the corresponding field contained within the reference table; and maintaining a token frequency cache in a high speed access memory for use in assigning weights to said tokens.
-
Specification