Method and apparatus for improving performance of approximate string queries using variable length high-quality grams

US 7,996,369 B2
Filed: 12/14/2008
Issued: 08/09/2011
Est. Priority Date: 11/14/2008
Status: Active Grant

First Claim

Patent Images

1. An improvement in an indexing method for efficient approximate string search of a query string s against a collection of data strings S corresponding to a gram dictionary D in a computer system comprising:

preprocessing the dictionary D into a plurality of grams of varying length between q_minand q_max;

starting from a current position in the query string s, searching for the longest substring that matches a gram in the dictionary D, if no such gram exists in the dictionary D, then materializing a substring of length g_minstarting from the current position;

checking if the found or materialized substring is a positional substring already found in the query string s, and if so, then not producing a positional gram corresponding to the found or materialized substring, otherwise producing a positional gram corresponding to the found or materialized substring; and

indexing the current position by one to the right in the query string and repeating the searching and checking until the current position in the query string S is greater than |s|−

q_min+1, where |s| is the length of query string s, so that a gram index list for query string s having variable gram length is generated denoted as the set of positional grams VG(s, D, q_min, q_max).

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A computer process, called VGRAM, improves the performance of these string search algorithms in computers by using a carefully chosen dictionary of variable-length grams based on their frequencies in the string collection. A dynamic programming algorithm for computing a tight lower bound on the number of common grams shared by two similar strings in order to improve query performance is disclosed. A method for automatically computing a dictionary of high-quality grams for a workload of queries. Improvement on query performance is achieved by these techniques by a cost-based quantitative approach to deciding good grams for approximate string queries. An approach for answering approximate queries efficiently based on discarding gram lists, and another is based on combining correlated lists. An indexing structure is reduced to a given amount of space, while retaining efficient query processing by using algorithms in a computer based on discarding gram lists and combining correlated lists.

Citations

10 Claims

1. An improvement in an indexing method for efficient approximate string search of a query string s against a collection of data strings S corresponding to a gram dictionary D in a computer system comprising:
- preprocessing the dictionary D into a plurality of grams of varying length between q_minand q_max;
  
  starting from a current position in the query string s, searching for the longest substring that matches a gram in the dictionary D, if no such gram exists in the dictionary D, then materializing a substring of length g_minstarting from the current position;
  
  checking if the found or materialized substring is a positional substring already found in the query string s, and if so, then not producing a positional gram corresponding to the found or materialized substring, otherwise producing a positional gram corresponding to the found or materialized substring; and
  
  indexing the current position by one to the right in the query string and repeating the searching and checking until the current position in the query string S is greater than |s|−
  
  q_min+1, where |s| is the length of query string s, so that a gram index list for query string s having variable gram length is generated denoted as the set of positional grams VG(s, D, q_min, q_max).
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The improvement of claim 1 where preprocessing the dictionary D into a plurality of grams of varying length between q_minand q_maxcomprises:
    - generating a frequency trie of q_max-grams for the strings for the dictionary D;
      
      collecting gram frequencies by counting on a trie without generating the shorter grams for the dictionary D, except for those grams at the end of a string; and
      
      selecting high quality grams.
  - 3. The improvement of claim 1 further comprising reducing the size of such an indexing structure to a predetermined amount or less of memory space in a computer system, while retaining efficient query processing by discarding selected gram lists, and/or combining correlated gram lists which make up the indexing structure.
  - 4. A memory medium for storing a plurality of instructions for controlling a computer to perform the method of claim 1.
  - 5. The improvement of claim 2 where collecting gram frequencies for the dictionary D comprises:
    - initializing the frequency trie to be empty;
      
      for each string s, generating all its positional q_max-grams;
      
      for each q_max-gram locating the corresponding leaf node by inserting the q_max-gram into the trie if the gram has not been previously inserted (the frequency for the corresponding leaf node being initialized to
      
      0);
      
      for each node on that path connecting the root of the trie to the leaf node corresponding to the last inserted q_max-gram, including this leaf node, incrementing its frequency by 1 thereby assigning thereto a frequency value n.freq; and
      
      at each q-th node (q_min≦
      
      q≦
      
      q_max) on the path, creating a leaf node by appending an edge with an endmarker symbol #, if this new leaf node has not been previously inserted into the trie, signifying that the q_max-gram has a prefix gram of length q that ends at this leaf node marking the edge.
  - 6. The improvement of claim 2 where selecting high quality grams comprises:
    - if a gram g has a low frequency, eliminating from the frequency trie all the extended grams of g; and
      
      if a gram is very frequent, keeping selected ones of the corresponding extended grams in the frequency trie.
  - 7. The improvement of claim 5 where for each string s generating all its positional q_max-grams comprises processing characters at the end of each string separately, since these characters do not produce positional q_max-grams, for each position p=|s|−
    - q_max+2, ;
      
      ;
      
      ;
      
      , |s|−
      
      q_min+1 of the string, generating a positional gram of length |s|−
      
      p+1, and for each positional gram of length |s|−
      
      p+1 locating the corresponding leaf node by inserting the positional gram of length |s|−
      
      p+1 into the trie if the gram has not been previously inserted (the frequency for the corresponding leaf node being initialized to
      
      0), for each node on that path connecting the root of the trie to the leaf node corresponding to the last inserted positional gram of length |s|−
      
      p+1, including this leaf node, incrementing its frequency by 1 thereby assigning thereto a frequency value n.freq; and
      
      at each q-th node (q_min≦
      
      q≦
      
      q_max) on the path, creating a leaf node by appending an edge with an endmarker symbol #, if this new leaf node has not been previously inserted into the trie, signifying that the positional gram of length |s|−
      
      p+1 has a prefix gram of length q that ends at this leaf node marking the edge.
  - 8. The improvement of claim 6 where if a gram g has a low frequency, eliminating from the frequency trie all the extended grams of g comprises:
    - choosing a frequency threshold, T; and
      
      pruning the frequency trie by checking nodes from the root down to determine if a current node n has a leaf-node child marked by an edge labeled by the end marker symbol #, if the current node n does not have any leaf-node child, then the path from the root to the current node n corresponds to a gram shorter than q_min, thus recursively pruning the frequency trie for each of the current node n'"'"'s children, if the current node n node has a leaf-node child L, then materializing a gram g corresponding to L with the frequency of node n, n.freq, if the frequency n.freq is already not greater than T, then we keeping the gram corresponding to leaf-node child L in the frequency trie, and then removing the children of current node n except for leaf-node child L, and assigning a frequency of n to leaf-node child L, so that after this pruning step, current node n has a single leaf-node child L.
  - 9. The improvement of claim 6 where if a gram is very frequent, keeping selected ones of the corresponding extended grams in the frequency trie comprises if n.freq>
    - T, selecting a maximal subset of the current node n'"'"'s children excluding leaf-node child L to remove, so that the summation of the frequencies of the maximal subset of the current node and the frequency of the leaf-node child, L.freq, is not greater than T, adding the summation of the frequencies of the maximal subset of the current node to the frequency of the leaf-node child, and for the remaining children the current node n, excluding leaf-node L, recursively pruning the subtrie.
  - 10. The improvement of claim 9 where selecting a maximal subset of the current node n'"'"'s children excluding leaf-node child L comprises choosing children with the smallest frequencies to remove, choosing children with the largest frequencies to remove, or randomly selecting children to remove so that the frequency of the leaf-node child, L.freq, is not greater than T after addition of the frequencies of the selected children which have been removed into the leaf-node child L'"'"'s frequency.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Regents of the University of California (University of California)
Original Assignee
Regents of the University of California (University of California)
Inventors
Li, Chen, Wang, Bin, Yang, Xaochun, Ji, Shengyue, Behm, Alexander, Lu, Jiaheng
Primary Examiner(s)
Al-Hashemi; Sana

Application Number

US12/334,471
Publication Number

US 20100125594A1
Time in Patent Office

968 Days
Field of Search

707/763, 707/696, 707/688
US Class Current

707/673
CPC Class Codes

G06F 16/90344 by using string matching te...

Method and apparatus for improving performance of approximate string queries using variable length high-quality grams

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

10 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for improving performance of approximate string queries using variable length high-quality grams

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

10 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links