Suffix array candidate selection and index data structure

US 20120117076A1
Filed: 06/30/2011
Published: 05/10/2012
Est. Priority Date: 11/09/2010
Status: Active Grant

First Claim

Patent Images

1. A method for identifying a candidate subset of a data set, the data set comprising a plurality of records structured with a data field, each record'"'"'s data field comprising a data field value, the data field value comprising a sequence of one or more unigrams, the method comprising:

recognizing a query field value, the query field value comprising a sequence of N unigrams beginning with U₁and ending with U_N, wherein U symbolizes a unigram and N symbolizes a non-negative integer value; and

performing a first step, a second step, a third step, and a fourth step of a candidate generation iterative loop, whereinthe first step comprises identifying a query field value suffix comprising a sequence of N−

J unigrams beginning with U_1+Jand ending with U_N, wherein J symbolizes a non-negative integer value less than N,the second step comprises identifying a qualifying subset of the data set, wherein each record in the qualifying subset satisfies a similarity criterion when the record'"'"'s data field value is compared to the query field value suffix,the third step comprises including, in the candidate subset, the identified qualifying subset records, andthe fourth step comprises if the number of records in the candidate subset is less than a satisfactory number of candidates, and if N−

J is greater than a minimum suffix length, incrementing J and performing the first step, the second step, the third step, and the fourth step of the candidate generation iterative loop.

View all claims

14 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and system for identifying a candidate subset of a data set comprises comparing suffixes of query field values to data field values of records in the data set. Sufficiently similar records are included in the candidate subset. Query field value suffixes may range in length from the query field value itself down to a minimum suffix length. The longest suffix may be processed first, and then successively shorter suffixes may be processed until a satisfactory number of candidates are identified. Entries in an index data structure derived from the data set may associate various suffixes found in the data set with individual records. The data structure entries may include record keys identifying records with data field values identical to the suffix and may also include suffix pointers identifying related data structure entries with suffixes similar to the entry'"'"'s suffix.

Citations

48 Claims

1. A method for identifying a candidate subset of a data set, the data set comprising a plurality of records structured with a data field, each record'"'"'s data field comprising a data field value, the data field value comprising a sequence of one or more unigrams, the method comprising:
- recognizing a query field value, the query field value comprising a sequence of N unigrams beginning with U₁and ending with U_N, wherein U symbolizes a unigram and N symbolizes a non-negative integer value; and
  
  performing a first step, a second step, a third step, and a fourth step of a candidate generation iterative loop, whereinthe first step comprises identifying a query field value suffix comprising a sequence of N−
  
  J unigrams beginning with U_1+Jand ending with U_N, wherein J symbolizes a non-negative integer value less than N,the second step comprises identifying a qualifying subset of the data set, wherein each record in the qualifying subset satisfies a similarity criterion when the record'"'"'s data field value is compared to the query field value suffix,the third step comprises including, in the candidate subset, the identified qualifying subset records, andthe fourth step comprises if the number of records in the candidate subset is less than a satisfactory number of candidates, and if N−
  
  J is greater than a minimum suffix length, incrementing J and performing the first step, the second step, the third step, and the fourth step of the candidate generation iterative loop.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
- - 2. The method of claim 1, wherein prior to completing a first iteration of the candidate generation iterative loop, J is equal to zero and the query field value suffix is identical to the query field value.
  - 3. The method of claim 1, wherein the data field value and the query field value each comprise a sequence of one or more graphemes.
  - 4. The method of claim 3, wherein the sequence of one or more graphemes is selected from the group consisting of:
    - a sequence of one or more alphabetic letters;
      
      a sequence of one or more alphanumeric characters;
      
      a sequence of one or more numerals; and
      
      a sequence of one or more Chinese characters.
  - 5. The method of claim 1, wherein identifying the query field value suffix comprises generating the query field value suffix.
  - 6. The method of claim 1, wherein identifying the query field value suffix comprises selecting the query field value suffix from a set of eligible query field value suffixes.
  - 7. The method of claim 6, further comprising generating, prior to performing the first step of a first interation of the candidate generation iterative loop, the set of eligible query field value suffixes.
  - 8. The method of claim 7, wherein generating the set of eligible query field value suffixes comprises performing a first step, a second step, and a third step of a suffix generation iterative loop, whereinthe first step comprises generating a suffix comprising a sequence of N-K unigrams beginning with U_1+Kand ending with U_N, wherein K symbolizes a non-negative integer value less than N,the second step comprises including, in the set of eligible query field value suffixes, the generated suffix, andthe third step comprises if N-K is greater than the minimum suffix length, incrementing K and performing the first step, the second step, and the third step of the suffix generation iterative loop.
  - 9. The method of claim 8, wherein prior to completing a first iteration of the suffix generation iterative loop, K is equal to zero and the suffix is identical to the query field value.
  - 10. The method of claim 1, wherein the data field value of each qualifying subset record comprises a sequence of unigrams identical to the query field value suffix.
  - 11. The method of claim 10, wherein the data field value of a first qualifying subset record comprises a sequence of greater than N−
    - J unigrams, and wherein the last N−
      
      J unigrams in the sequence are identical to the query field value suffix.
  - 12. The method of claim 10, wherein the data field value of a first qualifying subset record comprises a sequence of greater than N−
    - J unigrams, and wherein the first N−
      
      J unigrams in the sequence are identical to the query field value suffix.
  - 13. The method of claim 1, wherein the data field value of each qualifying subset record is associated with a similarity score when compared to the query field value suffix, and wherein the similarity score satisfies a minimum similarity score criterion.
  - 14. The method of claim 1, wherein identifying the qualifying subset of the data set comprises accessing an index data structure derived from the data set, the index data structure comprising a plurality of entries, each of the plurality of entries comprising a unigram sequence and associating the unigram sequence with one or more of the data set records.
  - 15. The method of claim 14, wherein identifying the qualifying subset of the data set further comprises:
    - identifying a matching entry in the index data structure, wherein the unigram sequence of the matching entry satisfies an index entry similarity criterion when compared to the query field value suffix; and
      
      including, in the qualifying subset of the data set, the one or more data set records associated with the unigram sequence of the matching entry.
  - 16. The method of claim 15, wherein the unigram sequence of the matching entry is identical to the query field value suffix.
  - 17. The method of claim 1, further comprising delivering the candidate subset to a filter process, wherein the filter process identifies a filtered subset of the candidate subset.

18. A method for identifying a candidate subset of a data set, the data set comprising a plurality of records structured with at least M data fields, M symbolizing a non-negative integer value, each of each record'"'"'s at least M data fields comprising a data field value, the data field value comprising a sequence of one or more unigrams, the method comprising:
- recognizing M query field values, each of the M query field values associated with one of the at least M data fields, each of the M query field values comprising a sequence of N unigrams beginning with U₁and ending with U_N, wherein U symbolizes a unigram and N symbolizes a non-negative integer value; and
  
  performing a first step, a second step, a third step, a fourth step, a fifth step, and a sixth step of a candidate generation iterative loop, whereinthe first step comprises identifying, for each of the M query field values wherein N−
  
  J is greater than a minimum suffix length, a query field value suffix comprising a sequence of N−
  
  J unigrams beginning with U_1+Jand ending with U_N, wherein J symbolizes a non-negative integer value less than N,the second step comprises identifying a qualifying subset of the data set, wherein each record in the qualifying subset satisfies a similarity criterion when at least one of the identified query field value suffixes is compared to its associated data field value,the third step comprises determining a similarity score for each record in the qualifying subset,the fourth step comprises identifying a threshold subset of the qualifying subset, wherein the similarity score for each record in the threshold subset satisfies a threshold similarity score,the fifth step comprises including, in the candidate subset, each record in the threshold subset, andthe sixth step comprises if the number of records in the candidate subset is less than a satisfactory number of candidates, incrementing J and performing the first step, the second step, the third step, the fourth step, the fifth step, and the sixth step of the candidate generation iterative loop.
- View Dependent Claims (19, 20, 21, 22, 23, 24, 25, 26)
- - 19. The method of claim 18, wherein prior to completing a first iteration of the candidate generation iterative loop, J is equal to zero and for each of the M query field values, the M query field value suffix is identical to the M query field value.
  - 20. The method of claim 18, wherein N is the same for each of the M query field values.
  - 21. The method of claim 18, wherein identifying the query field value suffix comprises selecting the query field value suffix from a set of eligible query field value suffixes.
  - 22. The method of claim 21, further comprising generating, prior to performing the first step of a first interation of the candidate generation iterative loop, the set of eligible query field value suffixes for each of the M query field values.
  - 23. The method of claim 18, wherein at least one data field value of each qualifying subset record comprises a sequence of unigrams identical to its associated query field value suffix.
  - 24. The method of claim 18, wherein identifying the qualifying subset of the data set comprises accessing an index data structure, the index data structure comprising a plurality of entries, each of the plurality of entries comprising a unigram sequence and associating the unigram sequence with one or more data set records.
  - 25. The method of claim 24, wherein identifying the qualifying subset of the data set further comprises:
    - identifying a matching entry in the index data structure, wherein the unigram sequence of the matching entry satisfies an index entry similarity criterion when compared to the query field value suffix; and
      
      including, in the qualifying subset of the data set, the one or more data set records associated with the unigram sequence of the matching entry.
  - 26. The method of claim 25, wherein the unigram sequence of the matching entry is identical to the query field value suffix.

27. An index data structure derived from a data set, the index data structure for use in identifying a candidate subset of the data set, the data set comprising a plurality of records structured with a data field, each record'"'"'s data field comprising a data field value, the data field value comprising a sequence of N unigrams beginning with U₁and ending with U_N, wherein U symbolizes a unigram and N symbolizes a non-negative integer value, the index data structure comprising:
- an index data structure entry for each data field value suffix of each record'"'"'s data field value, a data field value suffix comprising a sequence of N−
  
  J unigrams beginning with U_1+Jand ending with U_N, wherein J symbolizes each non-negative integer value less than N wherein N−
  
  J is greater than or equal to a minimum suffix length, wherein each index data structure entry comprises;
  
  an index unigram sequence identical to the data field value suffix; and
  
  record association data associating the index unigram sequence with at least one qualifying data set record, wherein the at least one qualifying data set record'"'"'s data field value contains the index unigram sequence.
- View Dependent Claims (28, 29, 30, 31)
- - 28. The index data structure of claim 27, wherein the index data structure comprises an entry for each unique data field value suffix in the data set.
  - 29. The index data structure of claim 27, wherein the index data structure is sorted by index unigram sequence.
  - 30. The index data structure of claim 27, wherein the record association data comprises:
    - zero or more record keys, each record key identifying one of the qualifying data set records, wherein the one qualifying data set record'"'"'s data field value is identical to the index unigram sequence; and
      
      zero or more suffix pointers, each suffix pointer identifying a related entry in the index data structure, the related entry'"'"'s unigram sequence comprising N−
      
      J+1 unigrams, wherein the index unigram sequence is identical to the last N−
      
      J unigrams of the related entry'"'"'s index unigram sequence.
  - 31. The index data structure of claim 30, further comprising:
    - a first index data structure entry comprising a first-entry index unigram sequence, zero first-entry record keys, a first first-entry suffix pointer, and a second first-entry suffix pointer, the first first-entry suffix pointer identifying a second index data structure entry, and the second first-entry suffix pointer identifying a third index data structure entry;
      
      the second index data structure entry comprising a second-entry index unigram sequence, a first second-entry record key, a second second-entry record key, and a second-entry suffix pointer, the last unigrams of the second-entry index unigram sequence identical to the first-entry index unigram sequence, the first second-entry record key identifying a first record in the data set, the first record'"'"'s data field value identical to the second-entry index unigram sequence, the second second-entry record key identifying a fourth record in the data set, the fourth record'"'"'s data field value identical to the second-entry index unigram sequence, and the second-entry suffix pointer identifying a fourth index data structure entry;
      
      the third index data structure entry comprising a third-entry index unigram sequence, a third-entry record key, and zero third-entry suffix pointers, the last unigrams of the third-entry index unigram sequence identical to the first-entry index unigram sequence, and the third-entry record key identifying a second record in the data set, the second record'"'"'s data field value identical to the third-entry index unigram sequence;
      
      the fourth index data structure entry comprising a fourth-entry index unigram sequence, zero fourth-entry record keys, and a fourth-entry suffix pointer, the last unigrams of the fourth-entry index unigram sequence identical to the second-entry index unigram sequence, and the fourth-entry suffix pointer identifying a fifth index data structure entry; and
      
      the fifth index data structure entry comprising a fifth-entry index unigram sequence, a fifth-entry record key, and zero fifth-entry suffix pointers, the last unigrams of the fifth-entry index unigram sequence identical to the fourth-entry index unigram sequence, and the fifth-entry record key identifying a third record in the data set, the third record'"'"'s data field value identical to the fifth-entry index unigram sequence.

32. A system for identifying a candidate subset of a data set, the data set comprising a plurality of records structured with a data field, each record'"'"'s data field comprising a data field value, the data field value comprising a sequence of one or more unigrams, the system comprising:
- an index data structure stored on one or more memory elements, the index data structure derived from the data set, the index data structure comprising a plurality of entries, each of the plurality of entries comprising an index unigram sequence and associating the index unigram sequence with one or more of the data set records; and
  
  a candidate generator implemented on one or more processors, the candidate generator for recognizing a query field value, the query field value comprising a sequence of N unigrams beginning with U₁and ending with U_N, wherein U symbolizes a unigram and N symbolizes a non-negative integer value, the candidate generator also for performing a first step, a second step, a third step, and a fourth step of a candidate generation iterative loop, whereinthe first step comprises identifying a query field value suffix comprising a sequence of N−
  
  J unigrams beginning with U_1+Jand ending with U_N, wherein J symbolizes a non-negative integer value less than N,the second step comprises identifying a qualifying subset of the data set, wherein each record in the qualifying subset satisfies a similarity criterion when the record'"'"'s data field value is compared to the query field value suffix, and wherein identifying the qualifying subset comprises accessing the index data structure,the third step comprises including, in the candidate subset, the identified qualifying subset records, andthe fourth step comprises if the number of records in the candidate subset is less than a satisfactory number of candidates, and if N−
  
  J is greater than a minimum suffix length, incrementing J and performing the first step, the second step, the third step, and the fourth step of the candidate generation iterative loop.
- View Dependent Claims (33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48)
- - 33. The system of claim 32, wherein prior to completing a first iteration of the candidate generation iterative loop, J is equal to zero and the query field value suffix is identical to the query field value.
  - 34. The system of claim 32, wherein identifying the query field value suffix comprises generating the query field value suffix.
  - 35. The system of claim 32, wherein identifying the query field value suffix comprises selecting the query field value suffix from a set of eligible query field value suffixes.
  - 36. The system of claim 35, wherein the candidate generator is further for generating, prior to performing the first step of a first interation of the candidate generation iterative loop, the set of eligible query field value suffixes.
  - 37. The system of claim 36, wherein generating the set of eligible query field value suffixes comprises performing a first step, a second step, and a third step of a suffix generation iterative loop, whereinthe first step comprises generating a suffix comprising a sequence of N-K unigrams beginning with U_1+Kand ending with U_N, wherein K symbolizes a non-negative integer value less than N,the second step comprises including, in the set of eligible query field value suffixes, the generated suffix, andthe third step comprises if N-K is greater than the minimum suffix length, incrementing K and performing the first step, the second step, and the third step of the suffix generation iterative loop.
  - 38. The system of claim 37, wherein prior to completing a first iteration of the suffix generation iterative loop, K is equal to zero and the suffix is identical to the query field value.
  - 39. The system of claim 32, wherein the data field value of each qualifying subset record comprises a sequence of unigrams identical to the query field value suffix.
  - 40. The system of claim 39, wherein the data field value of a first qualifying subset record comprises a sequence of greater than N−
    - J unigrams, and wherein the last N−
      
      J unigrams in the sequence are identical to the query field value suffix.
  - 41. The system of claim 39, wherein the data field value of a first qualifying subset record comprises a sequence of greater than N−
    - J unigrams, and wherein the first N−
      
      J unigrams in the sequence are identical to the query field value suffix.
  - 42. The system of claim 32, wherein the data field value of each qualifying subset record is associated with a similarity score when compared to the query field value suffix, and wherein the similarity score satisfies a minimum similarity score criterion.
  - 43. The system of claim 32, wherein identifying the qualifying subset of the data set further comprises:
    - identifying a matching entry in the index data structure, wherein the unigram sequence of the matching entry satisfies an index entry similarity criterion when compared to the query field value suffix; and
      
      including, in the qualifying subset of the data set, the one or more data set records associated with the unigram sequence of the matching entry.
  - 44. The system of claim 43, wherein the unigram sequence of the matching entry is identical to the query field value suffix.
  - 45. The system of claim 32, wherein the candidate generator is further for delivering the candidate subset to a filter process, wherein the filter process identifies a filtered subset of the candidate subset.
  - 46. The system of claim 32, wherein the candidate generator is further for modifying the recognized query field value prior to performing the first iteration of the candidate generation iterative loop.
  - 47. The system of claim 46, wherein modifying the recognized query field value comprises:
    - appending a unigram equal to U₁to the query field value;
      
      removing U₁from the query field value; and
      
      shifting the sequence of N unigrams to the left such that the unigram formerly in position U_iis in position U_i-1.
  - 48. The system of claim 32, further comprising:
    - an index data structure generator implemented on the one or more processors, the index data structure generator for generating the index data structure.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Cloud Software Group
Original Assignee
TIBCO Software Incorporated (Cloud Software Group)
Inventors
Austermann, Patrick

Granted Patent

US 8,745,061 B2
Time in Patent Office

Days
Field of Search
US Class Current

707/741
CPC Class Codes

G06F 16/2228   Indexing structures

G06F 16/2468   Fuzzy queries

G06F 16/3338   Query expansion

Suffix array candidate selection and index data structure

First Claim

14 Assignments

0 Petitions

Accused Products

Abstract

Citations

48 Claims

Specification

Solutions

Use Cases

Quick Links

Suffix array candidate selection and index data structure

First Claim

14 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

48 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links