Reducing comparisons for token-based entity resolution
First Claim
1. A system for reducing an amount of comparisons during entity resolution of records, the system comprising:
- an in-memory database system configured to store a plurality of records; and
token-based entity resolution circuitry configured to determine whether a current record is similar to one or more other records in the database, the token-based entity resolution circuitry including;
a token creator configured to create tokens from the plurality of records;
a token-record mapping creator configured to create a token-record mapping of tokens to records;
a token importance calculator configured to calculate token importance values for the tokens, each token importance value representing a level of amount of information contained within a respective token;
a token pruner configured to identify a token of the current record as unimportant based on token importance values of the tokens of the current record, the token pruner configured to remove the unimportant token from the token-record mapping, the identification and removal of the unimportant token comprising;
identifying a token having a highest token importance value within the current record;
marking at least one token as unimportant when a token importance value of the at least one token is less than a predetermined threshold relative to the highest token importance value in the current record; and
removing the at least one unimportant token from the token-record mapping such that records linked to the at least one unimportant token are not selected for comparison with the current record; and
a record selector configured to select only records sharing at least one common token with the current record such that the at least one common token does not include the token identified as unimportant; and
a record comparator configured to compare the current record with each of the selected records to determine whether the current record matches any of the selected records.
1 Assignment
0 Petitions
Accused Products
Abstract
A token-based database management system described herein may reduce an amount of comparisons during entity resolution of records. The system includes a token creator configured to create tokens from records, a token-record mapping creator configured to create a token-record mapping of tokens to records, a token importance calculator configured to calculate token importance values for the tokens, a token pruner configured to identify a token of the current record as unimportant based on token importance values of the tokens of the current record, and to remove the unimportant token from the token-record mapping, a record selector configured to select only records sharing at least one common token with the current record, and a record comparator configured to compare the current record with each of the selected records to determine whether the current record matches any of the selected records.
-
Citations
17 Claims
-
1. A system for reducing an amount of comparisons during entity resolution of records, the system comprising:
-
an in-memory database system configured to store a plurality of records; and token-based entity resolution circuitry configured to determine whether a current record is similar to one or more other records in the database, the token-based entity resolution circuitry including; a token creator configured to create tokens from the plurality of records; a token-record mapping creator configured to create a token-record mapping of tokens to records; a token importance calculator configured to calculate token importance values for the tokens, each token importance value representing a level of amount of information contained within a respective token; a token pruner configured to identify a token of the current record as unimportant based on token importance values of the tokens of the current record, the token pruner configured to remove the unimportant token from the token-record mapping, the identification and removal of the unimportant token comprising; identifying a token having a highest token importance value within the current record; marking at least one token as unimportant when a token importance value of the at least one token is less than a predetermined threshold relative to the highest token importance value in the current record; and removing the at least one unimportant token from the token-record mapping such that records linked to the at least one unimportant token are not selected for comparison with the current record; and a record selector configured to select only records sharing at least one common token with the current record such that the at least one common token does not include the token identified as unimportant; and a record comparator configured to compare the current record with each of the selected records to determine whether the current record matches any of the selected records. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A non-transitory computer-readable medium storing executable instructions that when executed cause at least one processor to:
-
create tokens from a plurality of records stored in a relational database comprising an in-memory database system; create a token-record mapping of tokens to records; calculate token importance values for the tokens, each token importance value representing a level of amount of information contained within a respective token; identify a token of a current record as unimportant based on token importance values of the tokens of the current record; remove the unimportant token from the token-record mapping, the identification and removal of the unimportant token comprising; identifying a token having a highest token importance value within the current record; marking at least one token as unimportant when a token importance value of the at least one token is less than a predetermined threshold relative to the highest token importance value in the current record; and removing the at least one unimportant token from the token-record mapping such that records linked to the at least one unimportant token are not selected for comparison with the current record; select only records sharing at least one common token with the current record such that the at least one common token does not include the token identified as unimportant; and compare the current record with each of the selected records to determine whether the current record matches any of the selected records. - View Dependent Claims (9, 10, 11, 12)
-
-
13. A computer-implemented method for entity resolution, the method comprising:
-
creating tokens from a plurality of records stored in a relational database comprising an in-memory database system; creating a token-record mapping of tokens to records; calculating token importance values for the tokens, each token importance value representing a level of amount of information contained within a respective token; identifying a token of a current record as unimportant based on token importance values of the tokens of the current record; removing the unimportant token from the token-record mapping, the identification and removal of the unimportant token comprising; identifying a token having a highest token importance value within the current record; marking at least one token as unimportant when a token importance value of the at least one token is less than a predetermined threshold relative to the highest token importance value in the current record; and removing the at least one unimportant token from the token-record mapping such that records linked to the at least one unimportant token are not selected for comparison with the current record; selecting only records sharing at least one common token with the current record such that the at least one common token does not include the token identified as unimportant; and comparing the current record with each of the selected records to determine whether the current record matches any of the selected records. - View Dependent Claims (14, 15, 16, 17)
-
Specification