Suffix array candidate selection and index data structure
First Claim
1. A method for identifying a candidate subset of a data set, the data set comprising a plurality of records structured with a data field, each record'"'"'s data field comprising a data field value, the data field value comprising a sequence of one or more unigrams, the method comprising:
- recognizing a query field value, the query field value comprising a sequence of N unigrams beginning with U1 and ending with UN, wherein U symbolizes a unigram and N symbolizes a non-negative integer value; and
performing a first step, a second step, a third step, and a fourth step of a candidate generation iterative loop, whereinthe first step comprises identifying a query field value suffix comprising a sequence of N−
J unigrams beginning with U1+J and ending with UN, wherein J symbolizes a non-negative integer value less than N,the second step comprises identifying a qualifying subset of the data set, wherein each record in the qualifying subset satisfies a similarity criterion when the record'"'"'s data field value is compared to the query field value suffix,the third step comprises including, in the candidate subset, the identified qualifying subset records, andthe fourth step comprises if the number of records in the candidate subset is less than a satisfactory number of candidates, and if N−
J is greater than a minimum suffix length, incrementing J and performing the first step, the second step, the third step, and the fourth step of the candidate generation iterative loop.
14 Assignments
0 Petitions
Accused Products
Abstract
A method and system for identifying a candidate subset of a data set comprises comparing suffixes of query field values to data field values of records in the data set. Sufficiently similar records are included in the candidate subset. Query field value suffixes may range in length from the query field value itself down to a minimum suffix length. The longest suffix may be processed first, and then successively shorter suffixes may be processed until a satisfactory number of candidates are identified. Entries in an index data structure derived from the data set may associate various suffixes found in the data set with individual records. The data structure entries may include record keys identifying records with data field values identical to the suffix and may also include suffix pointers identifying related data structure entries with suffixes similar to the entry'"'"'s suffix.
-
Citations
48 Claims
-
1. A method for identifying a candidate subset of a data set, the data set comprising a plurality of records structured with a data field, each record'"'"'s data field comprising a data field value, the data field value comprising a sequence of one or more unigrams, the method comprising:
-
recognizing a query field value, the query field value comprising a sequence of N unigrams beginning with U1 and ending with UN, wherein U symbolizes a unigram and N symbolizes a non-negative integer value; and performing a first step, a second step, a third step, and a fourth step of a candidate generation iterative loop, wherein the first step comprises identifying a query field value suffix comprising a sequence of N−
J unigrams beginning with U1+J and ending with UN, wherein J symbolizes a non-negative integer value less than N,the second step comprises identifying a qualifying subset of the data set, wherein each record in the qualifying subset satisfies a similarity criterion when the record'"'"'s data field value is compared to the query field value suffix, the third step comprises including, in the candidate subset, the identified qualifying subset records, and the fourth step comprises if the number of records in the candidate subset is less than a satisfactory number of candidates, and if N−
J is greater than a minimum suffix length, incrementing J and performing the first step, the second step, the third step, and the fourth step of the candidate generation iterative loop. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
-
-
18. A method for identifying a candidate subset of a data set, the data set comprising a plurality of records structured with at least M data fields, M symbolizing a non-negative integer value, each of each record'"'"'s at least M data fields comprising a data field value, the data field value comprising a sequence of one or more unigrams, the method comprising:
-
recognizing M query field values, each of the M query field values associated with one of the at least M data fields, each of the M query field values comprising a sequence of N unigrams beginning with U1 and ending with UN, wherein U symbolizes a unigram and N symbolizes a non-negative integer value; and performing a first step, a second step, a third step, a fourth step, a fifth step, and a sixth step of a candidate generation iterative loop, wherein the first step comprises identifying, for each of the M query field values wherein N−
J is greater than a minimum suffix length, a query field value suffix comprising a sequence of N−
J unigrams beginning with U1+J and ending with UN, wherein J symbolizes a non-negative integer value less than N,the second step comprises identifying a qualifying subset of the data set, wherein each record in the qualifying subset satisfies a similarity criterion when at least one of the identified query field value suffixes is compared to its associated data field value, the third step comprises determining a similarity score for each record in the qualifying subset, the fourth step comprises identifying a threshold subset of the qualifying subset, wherein the similarity score for each record in the threshold subset satisfies a threshold similarity score, the fifth step comprises including, in the candidate subset, each record in the threshold subset, and the sixth step comprises if the number of records in the candidate subset is less than a satisfactory number of candidates, incrementing J and performing the first step, the second step, the third step, the fourth step, the fifth step, and the sixth step of the candidate generation iterative loop. - View Dependent Claims (19, 20, 21, 22, 23, 24, 25, 26)
-
-
27. An index data structure derived from a data set, the index data structure for use in identifying a candidate subset of the data set, the data set comprising a plurality of records structured with a data field, each record'"'"'s data field comprising a data field value, the data field value comprising a sequence of N unigrams beginning with U1 and ending with UN, wherein U symbolizes a unigram and N symbolizes a non-negative integer value, the index data structure comprising:
an index data structure entry for each data field value suffix of each record'"'"'s data field value, a data field value suffix comprising a sequence of N−
J unigrams beginning with U1+J and ending with UN, wherein J symbolizes each non-negative integer value less than N wherein N−
J is greater than or equal to a minimum suffix length, wherein each index data structure entry comprises;an index unigram sequence identical to the data field value suffix; and record association data associating the index unigram sequence with at least one qualifying data set record, wherein the at least one qualifying data set record'"'"'s data field value contains the index unigram sequence. - View Dependent Claims (28, 29, 30, 31)
-
32. A system for identifying a candidate subset of a data set, the data set comprising a plurality of records structured with a data field, each record'"'"'s data field comprising a data field value, the data field value comprising a sequence of one or more unigrams, the system comprising:
-
an index data structure stored on one or more memory elements, the index data structure derived from the data set, the index data structure comprising a plurality of entries, each of the plurality of entries comprising an index unigram sequence and associating the index unigram sequence with one or more of the data set records; and a candidate generator implemented on one or more processors, the candidate generator for recognizing a query field value, the query field value comprising a sequence of N unigrams beginning with U1 and ending with UN, wherein U symbolizes a unigram and N symbolizes a non-negative integer value, the candidate generator also for performing a first step, a second step, a third step, and a fourth step of a candidate generation iterative loop, wherein the first step comprises identifying a query field value suffix comprising a sequence of N−
J unigrams beginning with U1+J and ending with UN, wherein J symbolizes a non-negative integer value less than N,the second step comprises identifying a qualifying subset of the data set, wherein each record in the qualifying subset satisfies a similarity criterion when the record'"'"'s data field value is compared to the query field value suffix, and wherein identifying the qualifying subset comprises accessing the index data structure, the third step comprises including, in the candidate subset, the identified qualifying subset records, and the fourth step comprises if the number of records in the candidate subset is less than a satisfactory number of candidates, and if N−
J is greater than a minimum suffix length, incrementing J and performing the first step, the second step, the third step, and the fourth step of the candidate generation iterative loop. - View Dependent Claims (33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48)
-
Specification