Methods and systems for implementing approximate string matching within a database

US 7,925,652 B2
Filed: 12/31/2007
Issued: 04/12/2011
Est. Priority Date: 12/31/2007
Status: Active Grant

First Claim

Patent Images

1. A computer-based method for character string matching of a candidate character string with a plurality of character string records stored within a database, said method comprising:

identifying a set of dissimilar character strings in the plurality of character string records stored in the database utilizing an optimization search to generate a set of dissimilar reference character strings;

computing a two-dimensional vector containing a frequency of occurrence of all unique n-grams in the candidate character string and a frequency of occurrence of all unique n-grams in the reference character string;

computing a similarity metric for the candidate character string, with respect to the reference character string, based on the two-dimensional vector;

determining a magnitude of the vector associated with the candidate character string as magnitude A;

determining a magnitude of the vector associated with the reference character string as magnitude B;

computing a dot product between the two vectors;

computing the similarity metric according to (dot product/(magnitude A×

magnitude B));

generating a binary index for each character string record stored within the database based on a comparison of an n-gram representation of a selected one of the character strings in the character string record and an n-gram representation of each of the set of dissimilar reference character strings, wherein an i-th bit of the binary index represents a degree of matching of the candidate string with the i-th reference character string;

generating a binary index for a respective one of a candidate character string in a candidate character string record;

for only each character string record stored within the database whose binary index exactly matches the binary index of the candidate character string, locating each character string record whose selected character string matches the respective character string of the candidate string record; and

indexing the candidate character string record within the database based on the matching.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A computer-based method for character string matching of a candidate character string with a plurality of character string records stored in a database is provided. The method includes identifying a set of reference character strings in the database wherein the reference character strings are identified utilizing an optimization search for a set of dissimilar character strings and generating an n-gram representation for one of the reference character strings in the set of reference character strings. The method also includes generating an n-gram representation for the candidate character string determining a similarity between the n-gram representations, and indexing the candidate character string within the database based on the determined similarities between the n-gram representation of the candidate character string and the reference character strings in the identified set.

Citations

15 Claims

1. A computer-based method for character string matching of a candidate character string with a plurality of character string records stored within a database, said method comprising:
- identifying a set of dissimilar character strings in the plurality of character string records stored in the database utilizing an optimization search to generate a set of dissimilar reference character strings;
  
  computing a two-dimensional vector containing a frequency of occurrence of all unique n-grams in the candidate character string and a frequency of occurrence of all unique n-grams in the reference character string;
  
  computing a similarity metric for the candidate character string, with respect to the reference character string, based on the two-dimensional vector;
  
  determining a magnitude of the vector associated with the candidate character string as magnitude A;
  
  determining a magnitude of the vector associated with the reference character string as magnitude B;
  
  computing a dot product between the two vectors;
  
  computing the similarity metric according to (dot product/(magnitude A×
  
  magnitude B));
  
  generating a binary index for each character string record stored within the database based on a comparison of an n-gram representation of a selected one of the character strings in the character string record and an n-gram representation of each of the set of dissimilar reference character strings, wherein an i-th bit of the binary index represents a degree of matching of the candidate string with the i-th reference character string;
  
  generating a binary index for a respective one of a candidate character string in a candidate character string record;
  
  for only each character string record stored within the database whose binary index exactly matches the binary index of the candidate character string, locating each character string record whose selected character string matches the respective character string of the candidate string record; and
  
  indexing the candidate character string record within the database based on the matching.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. A computer-based method according to claim 1 wherein computing a similarity metric for the candidate character string comprises using a structured query language calculation to compare contents of the two-dimensional vector.
  - 3. A computer-based method according to claim 1 wherein computing a similarity metric comprises implementing an n-gram frequency similarity calculation in ASCII structured query language.
  - 4. A computer-based method according to claim 3 further comprising using the n-gram frequency similarity computation to form a binary key that indicates a similarity between the candidate character string and each of the identified reference character strings.
  - 5. A computer-based method according to claim 1 wherein indexing the candidate character string within the database comprises:
    - implementing an n-gram frequency similarity calculation;
      
      using the calculation to form binary keys that indicates a similarity between a record associated with the candidate character string and records associated with each of the identified reference character strings;
      
      joining records that share the same binary key value; and
      
      sorting the joined records by relevance by summing the products of the frequency weights of all matching n-grams.
  - 6. A computer-based method according to claim 1 wherein indexing the candidate character string comprises generating a matrix of similarity metrics for the candidate character string as compared to the set of reference character strings.
  - 7. A computer-based method according to claim 1 wherein indexing the candidate character string comprises:
    - assigning a binary key corresponding to the reference character string a value of 1 if the similarity metric is above a predefined threshold; and
      
      assigning a binary key corresponding to the reference character string a value of 0 if the similarity metric is below the predefined threshold.
  - 8. A computer-based method according to claim 1 wherein identifying a set of dissimilar character strings in the plurality of character string records stored in the database comprises using a principal components factor analysis to identify a set of dissimilar character string records.

9. A computer programmed to:
- identify a set of dissimilar reference character strings in a database utilizing an optimization search;
  
  compute a two-dimensional vector containing a frequency of occurrence of all unique n-grams in the candidate character string and all unique n-grams in one of the reference character strings for each of the reference character string;
  
  compute a similarity metric for the candidate character string, with respect to the reference character string, based on the two-dimensional vectors;
  
  determine a magnitude of the vector associated with the candidate character string as magnitude A;
  
  determine a magnitude of the vector associated with the reference character string as magnitude B;
  
  compute a dot product between the two vectors;
  
  compute the similarity metric according to (dot product/(magnitude A×
  
  magnitude B));
  
  generate a binary index for each character string record stored within the database based on a comparison of an n-gram representation of a selected one of the character strings in the character string record and an n-gram representation of each of the set of dissimilar reference character strings, wherein an i-th bit of the binary index represents a degree of matching of the candidate string with the i-th reference character string;
  
  generate a binary index for a respective one of a candidate character string in a candidate character string record;
  
  for only each character string record stored within the database whose binary index exactly matches the binary index of the candidate character string, locate each character string record whose selected character string matches the respective character string of the candidate string record; and
  
  index the candidate character string record within the database based on the matching.
- View Dependent Claims (10, 11, 12)
- - 10. A computer according to claim 9 wherein to compute the similarity metric, said computer is programmed to utilize a structured query language calculation to compare contents of the two-dimensional vectors.
  - 11. A computer according to claim 9 wherein said computer is programmed to:
    - assign a binary key corresponding to the reference character string a value of 1 if the determined similarity is above a predefined threshold; and
      
      assign a binary key corresponding to the reference character string a value of 0 if the determined similarity is below the predefined threshold.
  - 12. A computer according to claim 9 wherein said computer is programmed to use a principal components factor analysis to identify a set of dissimilar reference character strings in the database.

13. A computer-based method for approximate matching of a candidate character string to a set of reference character strings stored within a database, said method comprising:
- identifying a set of dissimilar character strings in a plurality of character string records stored in the database utilizing an optimization search to generate a set of dissimilar reference character strings;
  
  generating a binary index for each character string record stored within the database based on a comparison of an n-gram representation of a selected one of the character strings in the character string record and an n-gram representation of each of the set of dissimilar reference character strings, wherein an i-th bit of the binary index represents a degree of matching of the candidate string with the i-th reference character string;
  
  generating a binary index for a respective one of a candidate character string in a candidate character string record;
  
  individually comparing the binary index of the candidate character string to the binary index for each reference character string in the set of reference character strings using a structured query language n-gram frequency similarity calculation to compare the n-gram representations by;
  
  a) determining a magnitude (A) of a vector associated with the n-gram representation of the candidate character string;
  
  b) determining a magnitude (B) of a vector associated with the n-gram representation of one of the reference character strings as magnitude B;
  
  c) computing a dot product between the two vectors; and
  
  d) computing the similarity metric for the candidate character string with respect to the reference character string according to (dot product/(magnitude A×
  
  magnitude B)); and
  
  repeating steps b), c), and d) for each reference character string wherein the number of records containing a reference character string is less than a number of the plurality of character string records.
- View Dependent Claims (14, 15)
- - 14. A computer-based method according to claim 13 wherein the candidate character string is one of a merchant name and a merchant address and the set of reference character strings within the database are, respectively, a larger set of merchant names and merchant addresses.
  - 15. A computer-based method according to claim 13 further comprising:
    - joining records that share the same binary index value; and
      
      sorting the joined records by relevance by summing the products of the frequency weights of all matching n-gram representations.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Mastercard International Incorporated (MasterCard Incorporated)
Original Assignee
Mastercard International Incorporated (MasterCard Incorporated)
Inventors
McGeehan, Thomas, Merz, Christopher J.
Primary Examiner(s)
Vital; Pierre M
Assistant Examiner(s)
Vo; Truong V

Application Number

US11/967,494
Publication Number

US 20090171955A1
Time in Patent Office

1,198 Days
Field of Search

707/727
US Class Current

707/727
CPC Class Codes

G06F 16/24558 Binary matching operations

G06F 16/90344 by using string matching te...

Methods and systems for implementing approximate string matching within a database

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

15 Claims

Specification

Solutions

Use Cases

Quick Links

Methods and systems for implementing approximate string matching within a database

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

15 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links