Parsing information in data records and in different languages
First Claim
1. A computer-implemented method for comparing a first data record and a second data record, wherein the first and second data records are located in one or more data sources, the first data record comprises a first attribute and the second data record comprises a second attribute, the method comprising:
- parsing the first and second attributes to produce a set of tokens for each of those attributes, wherein the data sources employ at least two different languages and at least one of the first and second attributes is expressed in a language employing other than a Latin alphabet;
calculating an average information score for the first attribute and the second attribute, wherein the average information score is calculated based upon a matching of tokens for each of the first and second attributes;
generating a weight for the first attribute and the second attribute; and
normalizing the weight based on the average information score;
wherein generating the weight comprises comparing each of a set of tokens of the first attribute to each of a set of tokens of the second attribute such that pairs of tokens are compared, and comparing each pair of tokens comprises;
determining a current match weight for a pair of tokens;
determining a first previous match weight corresponding to the pair of tokens;
determining a second previous match weight corresponding to the pair of tokens;
setting the weight to the current match weight in response to the current match weight being greater than the first previous match weight or the second previous match weight; and
setting the weight to the greater of the first previous match weight or the second previous match weight in response to either the first previous match weight or the second previous match weight being greater than the current match weight; and
linking the first data record and the second data record based on the normalized weight between the two attributes.
2 Assignments
0 Petitions
Accused Products
Abstract
Embodiments of systems and methods for comparing attributes of a data record are presented herein. In some embodiments, a weight is based on a comparison of the name (or other) attributes of data records. In some embodiments, an information score may be calculated for each of two name attributes to be compared to get an average information score for the two name attributes. The two name attributes may then be compared against one another to generate a weight between the two attributes. This weight can then be normalized to generate a final weight between the two business name attributes. Comparing attributes according to embodiments disclosed herein can facilitate linking data records even if they comprise attributes in languages which do not use the Latin alphabet.
-
Citations
24 Claims
-
1. A computer-implemented method for comparing a first data record and a second data record, wherein the first and second data records are located in one or more data sources, the first data record comprises a first attribute and the second data record comprises a second attribute, the method comprising:
-
parsing the first and second attributes to produce a set of tokens for each of those attributes, wherein the data sources employ at least two different languages and at least one of the first and second attributes is expressed in a language employing other than a Latin alphabet; calculating an average information score for the first attribute and the second attribute, wherein the average information score is calculated based upon a matching of tokens for each of the first and second attributes; generating a weight for the first attribute and the second attribute; and normalizing the weight based on the average information score; wherein generating the weight comprises comparing each of a set of tokens of the first attribute to each of a set of tokens of the second attribute such that pairs of tokens are compared, and comparing each pair of tokens comprises; determining a current match weight for a pair of tokens; determining a first previous match weight corresponding to the pair of tokens; determining a second previous match weight corresponding to the pair of tokens; setting the weight to the current match weight in response to the current match weight being greater than the first previous match weight or the second previous match weight; and setting the weight to the greater of the first previous match weight or the second previous match weight in response to either the first previous match weight or the second previous match weight being greater than the current match weight; and linking the first data record and the second data record based on the normalized weight between the two attributes. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. A computer program product for comparing a first data record and a second data record, wherein the first and second data records are located in one or more data sources, the first data record comprises a first attribute and the second data record comprises a second attribute, the computer program product comprising:
a computer readable storage medium having computer readable program code embodied therewith, the computer readable program code configured to; parse the first and second attributes to produce a set of tokens for each of those attributes, wherein the data sources employ at least two different languages and at least one of the first and second attributes is expressed in a language employing other than a Latin alphabet; calculate an average information score for the first attribute and the second attribute, wherein the average information score is calculated based upon a matching of tokens for each of the first and second attributes; generate a weight for the first attribute and the second attribute; and normalize the weight based on the average information score; wherein generating the weight comprises comparing each of a set of tokens of the first attribute to each of a set of tokens of the second attribute such that pair of tokens are compared, and comparing each pair of tokens comprises; determining a current match weight for a pair of tokens; determining a first previous match weight corresponding to the pair of tokens; determining a second previous match weight corresponding to the pair of tokens; setting the weight to the current match weight in response to the current match weight being greater than the first previous match weight or the second previous match weight; and setting the weight to the greater of the first previous match weight or the second previous match weight in response to either the first previous match weight or the second previous match weight being greater than the current match weight; and link the first data record and the second data record based on the normalized weight between the two attributes. - View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 21, 22, 23)
-
24. A system for comparing data records, the system comprising:
-
at least one data source comprising a first data record and a second data record, wherein the first data record comprises a first attribute and the second data record comprises a second attribute, and wherein the at least one data source employs at least two different languages; and a hub coupled with the at least one data source, the hub comprising a processor configured with logic to; parse the first and second attributes to produce a set of tokens for each of those attributes, wherein at least one of the first and second attributes is expressed in a language employing other than a Latin alphabet; calculate an average information score for the first attribute and the second attribute, wherein the average information score is calculated based upon a matching of tokens for each of the first and second attributes; generate a weight for the first attribute and the second attribute; and normalize the weight based on the average information score; wherein generating the weight comprises comparing each of a set of tokens of the first attribute to each of a set of tokens of the second attribute such that pair of tokens are compared, and comparing each pair of tokens comprises; determining a current match weight for a pair of tokens; determining a first previous match weight corresponding to the pair of tokens; determining a second previous match weight corresponding to the pair of tokens; setting the weight to the current match weight in response to the current match weight being greater than the first previous match weight or the second previous match weight; and setting the weight to the greater of the first previous match weight or the second previous match weight in response to either the first previous match weight or the second previous match weight being greater than the current match weight; and link the first data record and the second data record based on the normalized weight between the two attributes.
-
Specification