Parsing information in data records and in different languages

US 8,321,393 B2
Filed: 12/31/2007
Issued: 11/27/2012
Est. Priority Date: 03/29/2007
Status: Expired due to Fees

First Claim

Patent Images

1. A computer-implemented method for comparing a first data record and a second data record, wherein the first and second data records are located in one or more data sources, the first data record comprises a first attribute and the second data record comprises a second attribute, the method comprising:

parsing the first and second attributes to produce a set of tokens for each of those attributes, wherein the data sources employ at least two different languages and at least one of the first and second attributes is expressed in a language employing other than a Latin alphabet;

calculating an average information score for the first attribute and the second attribute, wherein the average information score is calculated based upon a matching of tokens for each of the first and second attributes;

generating a weight for the first attribute and the second attribute; and

normalizing the weight based on the average information score;

wherein generating the weight comprises comparing each of a set of tokens of the first attribute to each of a set of tokens of the second attribute such that pairs of tokens are compared, and comparing each pair of tokens comprises;

determining a current match weight for a pair of tokens;

determining a first previous match weight corresponding to the pair of tokens;

determining a second previous match weight corresponding to the pair of tokens;

setting the weight to the current match weight in response to the current match weight being greater than the first previous match weight or the second previous match weight; and

setting the weight to the greater of the first previous match weight or the second previous match weight in response to either the first previous match weight or the second previous match weight being greater than the current match weight; and

linking the first data record and the second data record based on the normalized weight between the two attributes.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Embodiments of systems and methods for comparing attributes of a data record are presented herein. In some embodiments, a weight is based on a comparison of the name (or other) attributes of data records. In some embodiments, an information score may be calculated for each of two name attributes to be compared to get an average information score for the two name attributes. The two name attributes may then be compared against one another to generate a weight between the two attributes. This weight can then be normalized to generate a final weight between the two business name attributes. Comparing attributes according to embodiments disclosed herein can facilitate linking data records even if they comprise attributes in languages which do not use the Latin alphabet.

Citations

24 Claims

1. A computer-implemented method for comparing a first data record and a second data record, wherein the first and second data records are located in one or more data sources, the first data record comprises a first attribute and the second data record comprises a second attribute, the method comprising:
- parsing the first and second attributes to produce a set of tokens for each of those attributes, wherein the data sources employ at least two different languages and at least one of the first and second attributes is expressed in a language employing other than a Latin alphabet;
  
  calculating an average information score for the first attribute and the second attribute, wherein the average information score is calculated based upon a matching of tokens for each of the first and second attributes;
  
  generating a weight for the first attribute and the second attribute; and
  
  normalizing the weight based on the average information score;
  
  wherein generating the weight comprises comparing each of a set of tokens of the first attribute to each of a set of tokens of the second attribute such that pairs of tokens are compared, and comparing each pair of tokens comprises;
  
  determining a current match weight for a pair of tokens;
  
  determining a first previous match weight corresponding to the pair of tokens;
  
  determining a second previous match weight corresponding to the pair of tokens;
  
  setting the weight to the current match weight in response to the current match weight being greater than the first previous match weight or the second previous match weight; and
  
  setting the weight to the greater of the first previous match weight or the second previous match weight in response to either the first previous match weight or the second previous match weight being greater than the current match weight; and
  
  linking the first data record and the second data record based on the normalized weight between the two attributes.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The method of claim 1, wherein the pairs of tokens are compared in an order of the set of tokens of the first attribute.
  - 3. The method of claim 2, wherein comparing each pair of tokens further comprises determining whether a first token or a second token of the pair of tokens is an acronym.
  - 4. The method of claim 3, wherein comparing each pair of tokens further comprises determining a set of pairs of tokens corresponding to the acronym in response to the first token or the second token being determined as an acronym and comparing each of the set of pairs before comparing any other pair of tokens.
  - 5. The method of claim 4, wherein determining a current match weight for each comparison of a pair of tokens comprises:
    - determining whether there is a match between the pair of tokens;
      
      determining a current match weight in response to a determination that there is a match between the pair of tokens;
      
      setting the current match weight to zero in response to a determination that there is not a match between the pair of tokens; and
      
      adjusting the current match weight for the pair of tokens by a third previous match weight.
  - 6. The method of claim 5, wherein the first previous match weight is generated based on a first previous token corresponding to a first token of the pair of tokens.
  - 7. The method of claim 6, wherein the second previous match weight is generated based on a second previous token corresponding to a second token of the pair of tokens.
  - 8. The method of claim 7, wherein determining whether a match exists comprises determining whether there is an exact match, an initial match, a phonetic match, a nickname match, a nickname-phonetic match or an edit distance match.
  - 9. The method of claim 8, wherein determining a current match weight for each comparison of a pair of tokens further comprises:
    - determining a first exact match weight for a first token of the pair of tokens;
      
      using the first exact match weight as the current match weight in response to a determination of an exact match between the pair of tokens; and
      
      in response to a determination of an initial match, a phonetic match or an edit distance match between the pair of tokens, determining the second exact match weight for the second token of the pair, taking the lesser of the first exact match weight and the second exact match weight as an initial match weight and applying a penalty to the initial match weight to generate the current match weight.
  - 10. The method of claim 9, wherein determining a current match weight further comprises determining whether to apply a distance penalty to the current match weight.
  - 11. The method of claim 10, wherein the determination of whether to apply a distance penalty is based on a difference between a last match position and a position of the current tokens.
  - 12. The method of claim 11, wherein the distance penalty is based on the difference.

13. A computer program product for comparing a first data record and a second data record, wherein the first and second data records are located in one or more data sources, the first data record comprises a first attribute and the second data record comprises a second attribute, the computer program product comprising:
- a computer readable storage medium having computer readable program code embodied therewith, the computer readable program code configured to;
  
  parse the first and second attributes to produce a set of tokens for each of those attributes, wherein the data sources employ at least two different languages and at least one of the first and second attributes is expressed in a language employing other than a Latin alphabet;
  
  calculate an average information score for the first attribute and the second attribute, wherein the average information score is calculated based upon a matching of tokens for each of the first and second attributes;
  
  generate a weight for the first attribute and the second attribute; and
  
  normalize the weight based on the average information score;
  
  wherein generating the weight comprises comparing each of a set of tokens of the first attribute to each of a set of tokens of the second attribute such that pair of tokens are compared, and comparing each pair of tokens comprises;
  
  determining a current match weight for a pair of tokens;
  
  determining a first previous match weight corresponding to the pair of tokens;
  
  determining a second previous match weight corresponding to the pair of tokens;
  
  setting the weight to the current match weight in response to the current match weight being greater than the first previous match weight or the second previous match weight; and
  
  setting the weight to the greater of the first previous match weight or the second previous match weight in response to either the first previous match weight or the second previous match weight being greater than the current match weight; and
  
  link the first data record and the second data record based on the normalized weight between the two attributes.
- View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 21, 22, 23)
- - 14. The computer program product of claim 13, wherein the pairs of tokens are compared in an order of the set of tokens of the first attribute.
  - 15. The computer program product of claim 14, wherein comparing each pair of tokens further comprises determining whether a first token or a second token of the pair of tokens is an acronym.
  - 16. The computer program product of claim 15, wherein comparing each pair of tokens further comprises determining a set of pairs of tokens corresponding to the acronym in response to the first token or the second token being determined as an acronym and comparing each of the set of pairs before comparing any other pair of tokens.
  - 17. The computer program product of claim 16, wherein determining a current match weight for each comparison of a pair of tokens comprises:
    - determining whether there is a match between the pair of tokens;
      
      determining a current match weight in response to a determination that there is a match between the pair of tokens;
      
      setting the current match weight to zero in response to a determination that there is not a match between the pair of tokens; and
      
      adjusting the current match weight for the pair of tokens by a third previous match weight.
  - 18. The computer program product of claim 17, wherein the first previous match weight is generated based on a first previous token corresponding to a first token of the pair of tokens.
  - 19. The computer program product of claim 18, wherein the second previous match weight is generated based on a second previous token corresponding to a second token of the pair of tokens.
  - 20. The computer program product of claim 19, wherein determining if a match exists comprises determining whether there is an exact match, an initial match, a phonetic match, a nickname match, a nickname-phonetic match or an edit distance match.
  - 21. The computer program product of claim 20, wherein determining a current match weight for each comparison of a pair of tokens further comprises:
    - determining a first exact match weight for a first token of the pair of tokens;
      
      using the first exact match weight as the current match weight in response to a determination of an exact match between the pair of tokens; and
      
      in response to a determination of an initial match, a phonetic match or an edit distance match between the pair of tokens, determining a second exact match weight for the second token of the pair, taking the lesser of the first exact match weight and the second exact match weight as an initial match weight and applying a penalty to the initial match weight to generate the current match weight.
  - 22. The computer program product of claim 21, wherein determining a current match weight further comprises determining whether to apply a distance penalty to the current match weight.
  - 23. The computer program product of claim 22, wherein the determination of whether to apply a distance penalty is based on a difference between a last match position and a position of the current tokens.

24. A system for comparing data records, the system comprising:
- at least one data source comprising a first data record and a second data record, wherein the first data record comprises a first attribute and the second data record comprises a second attribute, and wherein the at least one data source employs at least two different languages; and
  
  a hub coupled with the at least one data source, the hub comprising a processor configured with logic to;
  
  parse the first and second attributes to produce a set of tokens for each of those attributes, wherein at least one of the first and second attributes is expressed in a language employing other than a Latin alphabet;
  
  calculate an average information score for the first attribute and the second attribute, wherein the average information score is calculated based upon a matching of tokens for each of the first and second attributes;
  
  generate a weight for the first attribute and the second attribute; and
  
  normalize the weight based on the average information score;
  
  wherein generating the weight comprises comparing each of a set of tokens of the first attribute to each of a set of tokens of the second attribute such that pair of tokens are compared, and comparing each pair of tokens comprises;
  
  determining a current match weight for a pair of tokens;
  
  determining a first previous match weight corresponding to the pair of tokens;
  
  determining a second previous match weight corresponding to the pair of tokens;
  
  setting the weight to the current match weight in response to the current match weight being greater than the first previous match weight or the second previous match weight; and
  
  setting the weight to the greater of the first previous match weight or the second previous match weight in response to either the first previous match weight or the second previous match weight being greater than the current match weight; and
  
  link the first data record and the second data record based on the normalized weight between the two attributes.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Adams, Norman S., Schumacher, Scott
Primary Examiner(s)
Ruiz, Angelica

Application Number

US11/967,588
Publication Number

US 20080243832A1
Time in Patent Office

1,793 Days
Field of Search

None
US Class Current

707/705
CPC Class Codes

G06F 16/334   Query execution G06F16/335 ...

G06F 40/129   Handling non-Latin characte...

G06F 40/163   Handling of whitespace

Parsing information in data records and in different languages

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

24 Claims

Specification

Solutions

Use Cases

Quick Links

Parsing information in data records and in different languages

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

24 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links