Method and apparatus for improving performance of approximate string queries using variable length high-quality grams
First Claim
1. An improvement in an indexing method for efficient approximate string search of a query string s against a collection of data strings S corresponding to a gram dictionary D in a computer system comprising:
- preprocessing the dictionary D into a plurality of grams of varying length between qmin and qmax;
starting from a current position in the query string s, searching for the longest substring that matches a gram in the dictionary D, if no such gram exists in the dictionary D, then materializing a substring of length gmin starting from the current position;
checking if the found or materialized substring is a positional substring already found in the query string s, and if so, then not producing a positional gram corresponding to the found or materialized substring, otherwise producing a positional gram corresponding to the found or materialized substring; and
indexing the current position by one to the right in the query string and repeating the searching and checking until the current position in the query string S is greater than |s|−
qmin+1, where |s| is the length of query string s, so that a gram index list for query string s having variable gram length is generated denoted as the set of positional grams VG(s, D, qmin, qmax).
2 Assignments
0 Petitions
Accused Products
Abstract
A computer process, called VGRAM, improves the performance of these string search algorithms in computers by using a carefully chosen dictionary of variable-length grams based on their frequencies in the string collection. A dynamic programming algorithm for computing a tight lower bound on the number of common grams shared by two similar strings in order to improve query performance is disclosed. A method for automatically computing a dictionary of high-quality grams for a workload of queries. Improvement on query performance is achieved by these techniques by a cost-based quantitative approach to deciding good grams for approximate string queries. An approach for answering approximate queries efficiently based on discarding gram lists, and another is based on combining correlated lists. An indexing structure is reduced to a given amount of space, while retaining efficient query processing by using algorithms in a computer based on discarding gram lists and combining correlated lists.
-
Citations
10 Claims
-
1. An improvement in an indexing method for efficient approximate string search of a query string s against a collection of data strings S corresponding to a gram dictionary D in a computer system comprising:
-
preprocessing the dictionary D into a plurality of grams of varying length between qmin and qmax; starting from a current position in the query string s, searching for the longest substring that matches a gram in the dictionary D, if no such gram exists in the dictionary D, then materializing a substring of length gmin starting from the current position; checking if the found or materialized substring is a positional substring already found in the query string s, and if so, then not producing a positional gram corresponding to the found or materialized substring, otherwise producing a positional gram corresponding to the found or materialized substring; and indexing the current position by one to the right in the query string and repeating the searching and checking until the current position in the query string S is greater than |s|−
qmin+1, where |s| is the length of query string s, so that a gram index list for query string s having variable gram length is generated denoted as the set of positional grams VG(s, D, qmin, qmax). - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
Specification