Robust Matching for Identity Screening

US 20170374093A1
Filed: 06/28/2016
Published: 12/28/2017
Est. Priority Date: 06/28/2016
Status: Active Grant

First Claim

Patent Images

1. A system comprising:

one or more processors;

multiple tokenizers configured to tokenize, by the one or more processors and based at least in part on an identified region, a query string to receive query tokens;

a transformation provider configured to;

generate, by the one or more processors, transformation rules based at least in part on the identified region and the query tokens;

select, by the one or more processors, a one or more of the transformation rules; and

transform, by the one or more processors and based at least in part on the one or more transformation rules, the query tokens to obtain a query record including transformed tokens that account for regional token variations, the regional token variations being associated with the identified region; and

a token weight provider configured to assign, by the one or more processors, token weights for the transformed tokens of the query record based at least in part on the identified region;

a comparer configured to determine, by the one or more processors and based at least in part on the token weights, similarity values between the transformed tokens of the query record and a reference record.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The techniques described herein are directed to robust matching for identity screening. In some examples, the techniques can include generating a similarity score for received identity information compared to a reference record. In some examples, the techniques can utilize a region associated with the received identity information to weight tokens composing the identity information or of the reference record to adjust the similarity score. Moreover, the techniques can include multiple tokenizers, transformation providers, and token weight providers and the techniques can be configured to select between the multiple tokenizers, transformation providers, and token weight providers based at least in part on a region to improve the accuracy of the similarity score. The techniques can determine whether or not to flag or otherwise affirm an identity of an individual or entity associated with the entity information based at least in part on the similarity score.

110 Citations

20 Claims

1. A system comprising:
- one or more processors;
  
  multiple tokenizers configured to tokenize, by the one or more processors and based at least in part on an identified region, a query string to receive query tokens;
  
  a transformation provider configured to;
  
  generate, by the one or more processors, transformation rules based at least in part on the identified region and the query tokens;
  
  select, by the one or more processors, a one or more of the transformation rules; and
  
  transform, by the one or more processors and based at least in part on the one or more transformation rules, the query tokens to obtain a query record including transformed tokens that account for regional token variations, the regional token variations being associated with the identified region; and
  
  a token weight provider configured to assign, by the one or more processors, token weights for the transformed tokens of the query record based at least in part on the identified region;
  
  a comparer configured to determine, by the one or more processors and based at least in part on the token weights, similarity values between the transformed tokens of the query record and a reference record.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. A system as recited in claim 1, further comprising:
    - a signature generator configured to;
      
      generate, by the one or more processors, query signatures corresponding to the query record;
      
      index, by the one or more processors, the query signatures;
      
      generate, by the one or more processors, reference signatures corresponding to a reference record; and
      
      index, by the one or more processors, the reference signatures; and
      
      the comparer configured to;
      
      identify, by the one or more processors, candidate records from the index of signatures corresponding to the reference record, the candidate records corresponding to a one or more of the reference signatures that have signatures within an edit distance of the query signatures; and
      
      determine, by the one or more processors, similarity values between the transformed tokens of the query record and the candidate records.
  - 3. A system as recited in claim 1, wherein a tokenizer of the multiple tokenizers is configured to:
    - tokenize query strings in a first language,tokenize query strings in a second language, ortokenize query strings in the first language for a different dialect or cultural context.
  - 4. A system as recited in claim 1, wherein one or more transformation rules is configured to transliterate, by the one or more processors and based at least in part on the identified region or an identified language of the query string, the query string to obtain transliterated query tokens.
  - 5. A system as recited in claim 1,wherein the transformation rules include a list of synonym pairs corresponding to one or more of regions or languages;
    - andwherein transforming the query tokens into the query record comprises populating, by the one or more processors, the query record with synonyms corresponding to a subset of the query tokens based at least in part on one or more of the identified region or an identified language.
  - 6. A system as recited in claim 5,wherein the synonym pairs further include synonym costs associated with the synonym pairs;
    - andwherein the transformation provider is further configured to calculate, by the one or more processors, transformation costs based on one or more of synonym costs of synonym pairs used to transform the query tokens or edit distances between the query tokens and corresponding transformed query tokens, the edit distances including a quantification of how dissimilar.
  - 7. A system as recited in claim 6, the comparer further configured to:
    - based at least in part on the transformation costs, modify, by the one or more processors, the similarity values to obtain modified similarity values; and
      
      average, by the one or more processors, the modified similarity values to receive a similarity score.
  - 8. A system as recited in claim 1, wherein the transformation provider selects the one or more transformation rules by:
    - ranking, by the one or more processors, the transformation rules based at least in part on the identified region and the query tokens;
      
      selecting, by the one or more processors, a number of the transformation rules to receive the one or more transformation rules, where the number selected corresponds to a tolerated risk value.
  - 9. A system as recited in claim 1, the token weight provider assigning the token weights based at least in part on calculating an inverse document frequency of tokens included in the reference record.

10. A method comprising:
- identifying a region associated with a query string;
  
  based at least in part on the identified region and a language associated with the identified region;
  
  selecting, from multiple tokenizers, a tokenizer associated with the identified region or the language; and
  
  generating transformation rules based at least in part on the identified region and the language;
  
  tokenizing the query string by the tokenizer to receive query tokens;
  
  transforming, by one or more of the transformation rules, the query tokens to form a query record; and
  
  weighting tokens of the query record and weighting tokens of a reference record based at least in part on frequencies with which the tokens of the query record appear in the reference record and frequencies with which the tokens of the reference record appear in the reference record, respectively.
- View Dependent Claims (11, 12, 13, 14, 15, 16, 17)
- - 11. A method as claim 10 recites, wherein the identified region is a first region and the tokenizer is a first tokenizer, the method further comprising:
    - identifying a second region associated with the query string;
      
      based at least in part on the second region and a language associated with the second region;
      
      selecting a second tokenizer associated with the second region and the language of the second region; and
      
      generating a second set of transformation rules based on the second region and the language of the second region;
      
      tokenizing the query string by the tokenizer to receive additional query tokens;
      
      transforming, by at least one of the transformation rules of the second set of transformation rules, the additional query tokens to form an addendum to the query record; and
      
      weighting tokens of the addendum based at least in part on frequencies with which the tokens of the addendum appear in the reference record.
  - 12. A method as claim 11 recites, wherein a language associated with the first region and a language associated with the second region are different.
  - 13. A method as claim 11 recites further comprising:
    - retrieving a first reference record corresponding to the first region; and
      
      retrieving a second reference record corresponding to the second region.
  - 14. A method as claim 13 recites, wherein the transformation rules include rules to transliterate or translate one or more of the query tokens, the additional query tokens, the first reference record, or the second reference record.
  - 15. A method as claim 10 recites further comprising:
    - determining a similarity score between a token of the query record and a token of the reference record by fuzzy matching tokens of the query record with tokens of the reference record, the fuzzy matching based at least in part on weights of the tokens of one or more of the query record or the reference record and transformation costs associated with the one or more transformation rules.
  - 16. A method as claim 15 recites, wherein a weight of a token of the query record or the reference record includes an inverse document frequency of appearance of the token in the reference record.
  - 17. A method as claim 10 recites, wherein the query string includes an identification of an entity and the reference record includes multiple entity identifications.

18. A method comprising:
- receiving an identification;
  
  receiving a similarity score corresponding to a similarity between the received identification and an entity in a reference record, the similarity score;
  
  being weighted based at least in part on a region associated with the reference record and regional lingual traits associated with the region; and
  
  exceeding a score threshold; and
  
  affirming that the received identification corresponds to an entity associated with the identification in the reference record based at least in part on the similarity score; and
  
  flagging the received identification based at least in part on the affirmation.
- View Dependent Claims (19, 20)
- - 19. A method as recited in claim 18, wherein the score threshold corresponds to a threshold probability that the affirmation is a true positive and wherein flagging the received identification is an indication to either deny or grant a request by an entity supplying the received identification.
  - 20. A method as recited in claim 18, wherein the region associated with the reference record is different than a region associated with the received identification and wherein the regional lingual traits associated with the region include at least one of:
    - synonyms of the received identification related to the region;
      
      transliterations of the received identification related to the region;
      
      translations of the received identification related to the region;
      
      other variations of the received identification related to the region;
      
      ora frequency with which the received identification appears in the reference record or a corpus of documents.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Inventors
Dhar, Surajit, Dhingra, Disha, Ganjam, Kris K.

Granted Patent

US 10,200,397 B2
Time in Patent Office

Days
Field of Search
US Class Current
CPC Class Codes

G06F 16/332   Query formulation

G06F 16/3334   Selection or weighting of t...

G06F 21/30   Authentication, i.e. establ...

G06F 40/284   Lexical analysis, e.g. toke...

G06Q 50/265   Personal security, identity...

H04L 63/1433   Vulnerability analysis

Robust Matching for Identity Screening

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

110 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Robust Matching for Identity Screening

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

110 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links