Methods and systems for implementing approximate string matching within a database
First Claim
1. A computer-based method for character string matching of a candidate character string with a plurality of character string records stored within a database, said method comprising:
- selecting one or more of the character strings in the plurality of character string records to form a set of reference character strings using a principal components factor analysis (PCFA) to identify a set of dissimilar reference character strings in the plurality of character string records;
generating a binary index key for each of the character strings in the set of character strings, the binary index key comprising a plurality of bits of binary information, each bit indicating a degree of matching of the character string to the set of reference character strings;
generating a binary index key for the candidate character string;
determining a set of character string records stored within the database that include a binary index key that exactly matches the binary index key of the candidate character string;
from the determined set of character string records stored within the database that include a binary index key that exactly matches the binary index key of the candidate character string, locating each character string record whose selected character string matches the respective character string of the candidate string record; and
indexing the candidate character string record within the database based on the matching.
1 Assignment
0 Petitions
Accused Products
Abstract
A computer-based method for character string matching of a candidate character string with a plurality of character string records stored in a database is described. The method includes a) identifying a set of reference character strings in the database, the reference character strings identified utilizing an optimization search for a set of dissimilar character strings, b) generating an n-gram representation for one of the reference character strings in the set of reference character strings, c) generating an n-gram representation for the candidate character string, d) determining a similarity between the n-gram representations, e) repeating steps b) and d) for the remaining reference character strings in the set of identified reference character strings, and f) indexing the candidate character string within the database based on the determined similarities between the n-gram representation of the candidate character string and the reference character strings in the identified set.
-
Citations
18 Claims
-
1. A computer-based method for character string matching of a candidate character string with a plurality of character string records stored within a database, said method comprising:
-
selecting one or more of the character strings in the plurality of character string records to form a set of reference character strings using a principal components factor analysis (PCFA) to identify a set of dissimilar reference character strings in the plurality of character string records; generating a binary index key for each of the character strings in the set of character strings, the binary index key comprising a plurality of bits of binary information, each bit indicating a degree of matching of the character string to the set of reference character strings; generating a binary index key for the candidate character string; determining a set of character string records stored within the database that include a binary index key that exactly matches the binary index key of the candidate character string; from the determined set of character string records stored within the database that include a binary index key that exactly matches the binary index key of the candidate character string, locating each character string record whose selected character string matches the respective character string of the candidate string record; and indexing the candidate character string record within the database based on the matching. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. A computer database system for matching of a candidate character string with a plurality of character string records stored within a database managed by a relational database management system (RDBMS), said method comprising:
-
a memory including; a first data structure including character strings of identification data for a plurality of merchants; a second data structure including a binary index key representing a similarity of selected character strings to a set of predetermined dissimilar reference strings; a processor programmed to; select one or more of the character strings in the first data structure to form the set of predetermined dissimilar reference strings; generate the binary index key for each of the selected character strings, the binary index key comprising a plurality of bits of binary information, each bit indicating a degree of matching of the character string to the set of reference character strings, the binary index key formed using a process implemented in the RDBMS to use the n-gram frequency similarity calculation that indicates the similarity of a given record to each of the reference strings identified in a principal components factor analysis (PCFA); generate a binary index key for the candidate character string; determine a set of character string records stored within the first data structure associated with a binary index key that exactly matches the binary index key of the candidate character string; from the determined set of character string records stored within the first data structure that include a binary index key that exactly matches the binary index key of the candidate character string, locating each character string record whose selected character string matches the respective character string of the candidate string record; and index the candidate character string record within the database based on the matching. - View Dependent Claims (14, 15, 16, 17, 18)
-
Specification