×

Methods and systems for implementing approximate string matching within a database

  • US 7,925,652 B2
  • Filed: 12/31/2007
  • Issued: 04/12/2011
  • Est. Priority Date: 12/31/2007
  • Status: Active Grant
First Claim
Patent Images

1. A computer-based method for character string matching of a candidate character string with a plurality of character string records stored within a database, said method comprising:

  • identifying a set of dissimilar character strings in the plurality of character string records stored in the database utilizing an optimization search to generate a set of dissimilar reference character strings;

    computing a two-dimensional vector containing a frequency of occurrence of all unique n-grams in the candidate character string and a frequency of occurrence of all unique n-grams in the reference character string;

    computing a similarity metric for the candidate character string, with respect to the reference character string, based on the two-dimensional vector;

    determining a magnitude of the vector associated with the candidate character string as magnitude A;

    determining a magnitude of the vector associated with the reference character string as magnitude B;

    computing a dot product between the two vectors;

    computing the similarity metric according to (dot product/(magnitude A×

    magnitude B));

    generating a binary index for each character string record stored within the database based on a comparison of an n-gram representation of a selected one of the character strings in the character string record and an n-gram representation of each of the set of dissimilar reference character strings, wherein an i-th bit of the binary index represents a degree of matching of the candidate string with the i-th reference character string;

    generating a binary index for a respective one of a candidate character string in a candidate character string record;

    for only each character string record stored within the database whose binary index exactly matches the binary index of the candidate character string, locating each character string record whose selected character string matches the respective character string of the candidate string record; and

    indexing the candidate character string record within the database based on the matching.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×