Method of performing approximate substring indexing
First Claim
1. A method of indexing a query substring Q against a collection of data strings in a database D, the method comprising the steps of:
- a) preprocessing each string σ
in database D to generate a plurality of overlapping q-grams of a predetermined length q augmenting each q-gram with information indicating its position within string σ
to form a tuple comprising the position information and the q-gram, and creating an index of the positional q-gram tuple;
b) parsing the query substring Q into a plurality of overlapping positional q-grams of length q;
c) searching each index in database D to retrieve potential matches between the query Q substring plurality of overlapping q-grams and the preprocessed database D plurality of overlapping q-grams, a potential match defined as having a predetermined number of matching overlapping q-grams;
d) applying position-directed filtering to the potential matches retrieved in step c) to form a candidate set including only those potential matches with at least some q-grams in the same position order as substring query Qe) defining a predetermined maximum edit distance k between the query substring Q and database D;
f) after applying the position-directed filtering, calculating the edit distance between each candidate substring and the query substring; and
g) verifying the candidate set by removing from the candidate set each candidate substring having an edit distance greater than k.
2 Assignments
0 Petitions
Accused Products
Abstract
Approximate substring indexing is accomplished by decomposing each string in a database into overlapping “positional q-grams”, sequences of a predetermined length q, and containing information regarding the “position” of each q-gram within the string (i.e., 1st q-gram, 4th q-gram, etc.). An index is then formed of the tuples of the positional q-gram data (such as, for example, a B-tree index or a hash index). Each query applied to the database is similarly parsed into a plurality of positional q-grams (of the same length), and a candidate set of matches is found. Position-directed filtering is used to remove the candidates which have the q-grams in the wrong order and/or too far apart to form a “verified” output of matching candidates. If errors are permitted (defined in terms of an edit distance between each candidate and the query), an edit distance calculation can then be performed to produce the final set of matching strings.
-
Citations
10 Claims
-
1. A method of indexing a query substring Q against a collection of data strings in a database D, the method comprising the steps of:
-
a) preprocessing each string σ
in database D to generate a plurality of overlapping q-grams of a predetermined length q augmenting each q-gram with information indicating its position within string σ
to form a tuple comprising the position information and the q-gram, and creating an index of the positional q-gram tuple;b) parsing the query substring Q into a plurality of overlapping positional q-grams of length q; c) searching each index in database D to retrieve potential matches between the query Q substring plurality of overlapping q-grams and the preprocessed database D plurality of overlapping q-grams, a potential match defined as having a predetermined number of matching overlapping q-grams; d) applying position-directed filtering to the potential matches retrieved in step c) to form a candidate set including only those potential matches with at least some q-grams in the same position order as substring query Q e) defining a predetermined maximum edit distance k between the query substring Q and database D; f) after applying the position-directed filtering, calculating the edit distance between each candidate substring and the query substring; and g) verifying the candidate set by removing from the candidate set each candidate substring having an edit distance greater than k. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A database processing and searching system comprising:
-
a computing arrangement including an input device for receiving a query substring Q to be searched, a processor, an output device and a storage device; and a database D of string data stored on the storage device using an index of positional q-grams for each string, each q-gram overlapping with its neighbor and stored as a tuple containing both the q-gram and information related to its position within the string, wherein the computing arrangement is used to defined a predetermined maximum edit distance k between the query substring Q and a database D, the processor for parsing a given query substring into a plurality of overlapping q-grams and searching each index in the database to determine candidate matching strings, a potential matching string and calculate the edit distance between each candidate substring and the edit distance between each candidate substring and the query substring;
defined as having a predetermined number of matching overlapping q-grams, said processor further comprising a position-directed filtering element for eliminating candidate strings having at least one of the following characteristics;
(1) q-grams in a different position order (2) q-grams in the same position but separated by greater than a predetermined distance when compared to the query substring, and (3) candidate strings having an edit distance greater than k, to form a verified candidate string set, wherein the processor passes the verified candidate string set to the output device as the output of the database processing and searching system. - View Dependent Claims (9, 10)
-
Specification