Methods and systems for implementing approximate string matching within a database
First Claim
1. A computer-based method for character string matching of a candidate character string with a plurality of character string records stored within a database, said method comprising:
- identifying a set of dissimilar character strings in the plurality of character string records stored in the database utilizing an optimization search to generate a set of dissimilar reference character strings;
computing a two-dimensional vector containing a frequency of occurrence of all unique n-grams in the candidate character string and a frequency of occurrence of all unique n-grams in the reference character string;
computing a similarity metric for the candidate character string, with respect to the reference character string, based on the two-dimensional vector;
determining a magnitude of the vector associated with the candidate character string as magnitude A;
determining a magnitude of the vector associated with the reference character string as magnitude B;
computing a dot product between the two vectors;
computing the similarity metric according to (dot product/(magnitude A×
magnitude B));
generating a binary index for each character string record stored within the database based on a comparison of an n-gram representation of a selected one of the character strings in the character string record and an n-gram representation of each of the set of dissimilar reference character strings, wherein an i-th bit of the binary index represents a degree of matching of the candidate string with the i-th reference character string;
generating a binary index for a respective one of a candidate character string in a candidate character string record;
for only each character string record stored within the database whose binary index exactly matches the binary index of the candidate character string, locating each character string record whose selected character string matches the respective character string of the candidate string record; and
indexing the candidate character string record within the database based on the matching.
1 Assignment
0 Petitions
Accused Products
Abstract
A computer-based method for character string matching of a candidate character string with a plurality of character string records stored in a database is provided. The method includes identifying a set of reference character strings in the database wherein the reference character strings are identified utilizing an optimization search for a set of dissimilar character strings and generating an n-gram representation for one of the reference character strings in the set of reference character strings. The method also includes generating an n-gram representation for the candidate character string determining a similarity between the n-gram representations, and indexing the candidate character string within the database based on the determined similarities between the n-gram representation of the candidate character string and the reference character strings in the identified set.
-
Citations
15 Claims
-
1. A computer-based method for character string matching of a candidate character string with a plurality of character string records stored within a database, said method comprising:
-
identifying a set of dissimilar character strings in the plurality of character string records stored in the database utilizing an optimization search to generate a set of dissimilar reference character strings; computing a two-dimensional vector containing a frequency of occurrence of all unique n-grams in the candidate character string and a frequency of occurrence of all unique n-grams in the reference character string; computing a similarity metric for the candidate character string, with respect to the reference character string, based on the two-dimensional vector; determining a magnitude of the vector associated with the candidate character string as magnitude A; determining a magnitude of the vector associated with the reference character string as magnitude B; computing a dot product between the two vectors; computing the similarity metric according to (dot product/(magnitude A×
magnitude B));generating a binary index for each character string record stored within the database based on a comparison of an n-gram representation of a selected one of the character strings in the character string record and an n-gram representation of each of the set of dissimilar reference character strings, wherein an i-th bit of the binary index represents a degree of matching of the candidate string with the i-th reference character string; generating a binary index for a respective one of a candidate character string in a candidate character string record; for only each character string record stored within the database whose binary index exactly matches the binary index of the candidate character string, locating each character string record whose selected character string matches the respective character string of the candidate string record; and indexing the candidate character string record within the database based on the matching. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A computer programmed to:
-
identify a set of dissimilar reference character strings in a database utilizing an optimization search; compute a two-dimensional vector containing a frequency of occurrence of all unique n-grams in the candidate character string and all unique n-grams in one of the reference character strings for each of the reference character string; compute a similarity metric for the candidate character string, with respect to the reference character string, based on the two-dimensional vectors; determine a magnitude of the vector associated with the candidate character string as magnitude A; determine a magnitude of the vector associated with the reference character string as magnitude B; compute a dot product between the two vectors; compute the similarity metric according to (dot product/(magnitude A×
magnitude B));generate a binary index for each character string record stored within the database based on a comparison of an n-gram representation of a selected one of the character strings in the character string record and an n-gram representation of each of the set of dissimilar reference character strings, wherein an i-th bit of the binary index represents a degree of matching of the candidate string with the i-th reference character string; generate a binary index for a respective one of a candidate character string in a candidate character string record; for only each character string record stored within the database whose binary index exactly matches the binary index of the candidate character string, locate each character string record whose selected character string matches the respective character string of the candidate string record; and index the candidate character string record within the database based on the matching. - View Dependent Claims (10, 11, 12)
-
-
13. A computer-based method for approximate matching of a candidate character string to a set of reference character strings stored within a database, said method comprising:
-
identifying a set of dissimilar character strings in a plurality of character string records stored in the database utilizing an optimization search to generate a set of dissimilar reference character strings; generating a binary index for each character string record stored within the database based on a comparison of an n-gram representation of a selected one of the character strings in the character string record and an n-gram representation of each of the set of dissimilar reference character strings, wherein an i-th bit of the binary index represents a degree of matching of the candidate string with the i-th reference character string; generating a binary index for a respective one of a candidate character string in a candidate character string record; individually comparing the binary index of the candidate character string to the binary index for each reference character string in the set of reference character strings using a structured query language n-gram frequency similarity calculation to compare the n-gram representations by; a) determining a magnitude (A) of a vector associated with the n-gram representation of the candidate character string; b) determining a magnitude (B) of a vector associated with the n-gram representation of one of the reference character strings as magnitude B; c) computing a dot product between the two vectors; and d) computing the similarity metric for the candidate character string with respect to the reference character string according to (dot product/(magnitude A×
magnitude B)); andrepeating steps b), c), and d) for each reference character string wherein the number of records containing a reference character string is less than a number of the plurality of character string records. - View Dependent Claims (14, 15)
-
Specification