Method and system for comparing strings with entries of a lexicon
First Claim
1. A method of comparing strings with entries of a lexicon, comprising the steps of:
- organizing entries of the lexicon by, for each entry;
(a) forming an n-gram vector representing a lexicon entry;
(b) folding said n-gram vector into a signature vector by combining multiple n-grams into bits;
(c) creating a list of bits having the same n-grams as the bits of the signature vector, beginning with the bit occurring most frequently in the lexicon and continuing in descending order;
(d) creating a partitioned vector whose element values are binary numbers whose digits represent the bits of the list of bits, partitioned into groups of digits forming the binary numbers, the digits being set or not set depending on whether the bit of the signature vector having the same n-grams is set;
(e) establishing a plurality of buckets having addresses corresponding to all possible element values of the partitioned vector;
(f) indexing said lexicon entry to the one or more of said buckets having an address corresponding to an element value of said lexicon entry'"'"'s partitioned vector;
reducing the number of lexicon entries to be compared to a particular unverified string by;
creating a partitioned vector for said unverified string according to steps (a)-(d);
indexing said unverified string to those buckets whose addresses correspond to an element value of said unverified string'"'"'s partitioned vector; and
comparing a representation of said unverified string with representations of only those lexicon entries that are indexed to the bucket addresses to which said unverified string is indexed.
1 Assignment
0 Petitions
Accused Products
Abstract
A system and method for more efficiently comparing an unverified string to a lexicon, which filters the lexicon through multiple steps to reduce the number of entries to be directly compared with the unverified string. The method begins by preparing the lexicon with an n-gram encoding, partitioning and hashing process, which can be accomplished in advance of any processing of unverified strings. The unknown is compared first by partitioning and hashing it in the same way to reduce the lexicon in a computationally inexpensive manner. This is followed by an encoded vector comparison step, and finally by a direct string comparison step, which is the most computationally expensive. The reduction of the lexicon is accomplished without arbitrarily eliminating any large portions of the lexicon that might contain relevant candidates. At the same time, the method avoids the need to compare the unverified string directly or indirectly with all the entries in the lexicon. The final candidate list includes only highly possible and ranked candidates for the unverified string, and the size of the final list is adjustable.
-
Citations
17 Claims
-
1. A method of comparing strings with entries of a lexicon, comprising the steps of:
-
organizing entries of the lexicon by, for each entry; (a) forming an n-gram vector representing a lexicon entry; (b) folding said n-gram vector into a signature vector by combining multiple n-grams into bits; (c) creating a list of bits having the same n-grams as the bits of the signature vector, beginning with the bit occurring most frequently in the lexicon and continuing in descending order; (d) creating a partitioned vector whose element values are binary numbers whose digits represent the bits of the list of bits, partitioned into groups of digits forming the binary numbers, the digits being set or not set depending on whether the bit of the signature vector having the same n-grams is set; (e) establishing a plurality of buckets having addresses corresponding to all possible element values of the partitioned vector; (f) indexing said lexicon entry to the one or more of said buckets having an address corresponding to an element value of said lexicon entry'"'"'s partitioned vector; reducing the number of lexicon entries to be compared to a particular unverified string by; creating a partitioned vector for said unverified string according to steps (a)-(d); indexing said unverified string to those buckets whose addresses correspond to an element value of said unverified string'"'"'s partitioned vector; and comparing a representation of said unverified string with representations of only those lexicon entries that are indexed to the bucket addresses to which said unverified string is indexed. - View Dependent Claims (2, 3, 4, 5, 8, 9, 17)
-
- 6. The method of claim 38, further comprising the step of comparing the unverified string in its original form and the further reduced entries of the lexicon in their original form.
-
10. A system for linking entries of a lexicon to an unverified string in an image, comprising:
-
an imager operative to acquire and store an image of a string of characters; and a processor configured to; organize entries of the lexicon by, for each entry; (a) form an n-gram vector representing a lexicon entry; (b) fold said n-gram vector into a signature vector by combining multiple ngrams into bits; (c) determine a list of bits having the same n-grams as the bits of the signature vector, beginning with the bit occurring most frequently in the lexicon and continuing in descending order; (d) create a partitioned vector whose element values are binary numbers whose digits represent the bits of the list of bits, partitioned into groups of digits forming the binary numbers, the digits being set or not set depending on whether the bit of the signature vector having the same n-grams is set; (e) establish a plurality of buckets having addresses corresponding to all possible element values of the partitioned vector; (f) index said lexicon entry to the one or more of said buckets having an address corresponding to an element value of said lexicon entry'"'"'s partitioned vector; reduce the number of lexicon entries to be compared to a particular unverified string by; creating a partitioned vector for said unverified string according to steps (a)-(d); indexing said unverified string to those buckets whose addresses correspond to an element value of said unverified string'"'"'s partitioned vector; and compare a representation of said unverified string with representations of only those lexicon entries that are indexed to the bucket addresses to which said unverified string is indexed.
-
-
11. A method of comparing an unverified string with entries of a lexicon, comprising the steps of:
-
for each entry of the lexicon, forming an n-gram vector representing the entry; folding said n-gram vector into a signature vector of reduced bit length; and partitioning the bits of each signature vector of each entry of said lexicon into groups each having a predetermined number of bits arranged in descending order of frequency of appearance of each bit in the lexicon; forming buckets of the entries of the lexicon indexed to bits of said signature vectors by indexing an entry to a bucket identified by numerical values associated with one or more of the groups, the numerical values each formed by all the bits of a group; comparing a representation of the unverified string with representations of only those lexicon entries mapped to the buckets of a subset of said buckets, by; forming an n-gram vector representing the unverified string; folding said n-gram vector of the unverified string into a signature vector of reduced bit length; and partitioning the bits of the signature vector of the unverified string into groups each having a predetermined number of bits arranged in descending order of frequency of appearance of each bit in the lexicon, indexing the unverified string to said buckets based on the numerical values formed by the bits of one or more of the groups of the unverified string'"'"'s partitioned signature vector, and comparing a representation of the unverified string with representations of only those lexicon entries indexed at least one of the same buckets as the unverified string. - View Dependent Claims (12, 13, 14, 15, 16)
-
Specification