Method of performing approximate substring indexing

US 7,010,522 B1
Filed: 06/17/2002
Issued: 03/07/2006
Est. Priority Date: 06/17/2002
Status: Expired due to Fees

First Claim

Patent Images

1. A method of indexing a query substring Q against a collection of data strings in a database D, the method comprising the steps of:

a) preprocessing each string σ

in database D to generate a plurality of overlapping q-grams of a predetermined length q augmenting each q-gram with information indicating its position within string σ

to form a tuple comprising the position information and the q-gram, and creating an index of the positional q-gram tuple;

b) parsing the query substring Q into a plurality of overlapping positional q-grams of length q;

c) searching each index in database D to retrieve potential matches between the query Q substring plurality of overlapping q-grams and the preprocessed database D plurality of overlapping q-grams, a potential match defined as having a predetermined number of matching overlapping q-grams;

d) applying position-directed filtering to the potential matches retrieved in step c) to form a candidate set including only those potential matches with at least some q-grams in the same position order as substring query Qe) defining a predetermined maximum edit distance k between the query substring Q and database D;

f) after applying the position-directed filtering, calculating the edit distance between each candidate substring and the query substring; and

g) verifying the candidate set by removing from the candidate set each candidate substring having an edit distance greater than k.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Approximate substring indexing is accomplished by decomposing each string in a database into overlapping “positional q-grams”, sequences of a predetermined length q, and containing information regarding the “position” of each q-gram within the string (i.e., 1^stq-gram, 4^thq-gram, etc.). An index is then formed of the tuples of the positional q-gram data (such as, for example, a B-tree index or a hash index). Each query applied to the database is similarly parsed into a plurality of positional q-grams (of the same length), and a candidate set of matches is found. Position-directed filtering is used to remove the candidates which have the q-grams in the wrong order and/or too far apart to form a “verified” output of matching candidates. If errors are permitted (defined in terms of an edit distance between each candidate and the query), an edit distance calculation can then be performed to produce the final set of matching strings.

Citations

10 Claims

1. A method of indexing a query substring Q against a collection of data strings in a database D, the method comprising the steps of:
- a) preprocessing each string σ
  
  in database D to generate a plurality of overlapping q-grams of a predetermined length q augmenting each q-gram with information indicating its position within string σ
  
  to form a tuple comprising the position information and the q-gram, and creating an index of the positional q-gram tuple;
  
  b) parsing the query substring Q into a plurality of overlapping positional q-grams of length q;
  
  c) searching each index in database D to retrieve potential matches between the query Q substring plurality of overlapping q-grams and the preprocessed database D plurality of overlapping q-grams, a potential match defined as having a predetermined number of matching overlapping q-grams;
  
  d) applying position-directed filtering to the potential matches retrieved in step c) to form a candidate set including only those potential matches with at least some q-grams in the same position order as substring query Qe) defining a predetermined maximum edit distance k between the query substring Q and database D;
  
  f) after applying the position-directed filtering, calculating the edit distance between each candidate substring and the query substring; and
  
  g) verifying the candidate set by removing from the candidate set each candidate substring having an edit distance greater than k.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method as defined in claim 1 wherein in performing step d) the candidate set is reduced by maintaining only the potential matches with a majority of its q-grams in the same position order as substring query Q.
  - 3. The method as defined in claim 1, wherein in performing step a), a B-tree index is created for each positional q-gram tuple.
  - 4. The method as defined in claim 1, wherein in performing step a), a hash index is created for each positional q-gram tuple.
  - 5. The method as defined in claim 1 wherein in performing step d), the following steps are performed:
    - i) determining a predetermined number of q-gram matches required to define a substring match and a predetermined maximum separation distance; and
      
      ii) comparing the order of the candidate matching q-grams against the query q-grams, retaining only candidate substrings with at least the predetermined number of q-gram matches in the same order as the query substring within the predetermined maximum separation distance.
  - 6. The method as defined in claim 1 wherein the edit distance k is defined as the total number of changes, in terms of additions, deletions, and substitutions, required to transform the candidate substring into the query substring.
  - 7. The method as defined in claim 1 wherein k=0 and exact matching is required.

8. A database processing and searching system comprising:
- a computing arrangement including an input device for receiving a query substring Q to be searched, a processor, an output device and a storage device; and
  
  a database D of string data stored on the storage device using an index of positional q-grams for each string, each q-gram overlapping with its neighbor and stored as a tuple containing both the q-gram and information related to its position within the string, wherein the computing arrangement is used to defined a predetermined maximum edit distance k between the query substring Q and a database D,the processor for parsing a given query substring into a plurality of overlapping q-grams and searching each index in the database to determine candidate matching strings, a potential matching string and calculate the edit distance between each candidate substring and the edit distance between each candidate substring and the query substring;
  
  defined as having a predetermined number of matching overlapping q-grams, said processor further comprising a position-directed filtering element for eliminating candidate strings having at least one of the following characteristics;
  
  (1) q-grams in a different position order (2) q-grams in the same position but separated by greater than a predetermined distance when compared to the query substring, and (3) candidate strings having an edit distance greater than k, to form a verified candidate string set, wherein the processor passes the verified candidate string set to the output device as the output of the database processing and searching system.
- View Dependent Claims (9, 10)
- - 9. The database processing and searching system as defined in claim 8 wherein B-tree indexes are used to store the positional q-gram information in the database.
  - 10. The database processing and searching system as defined in claim 8 wherein hash indexes are used to store the positional q-gram information in the database.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
AT&T Corporation (AT&T, Inc.)
Original Assignee
AT&T Corporation (AT&T, Inc.)
Inventors
Jagadish, Hosagrahar Visvesvaraya, Koudas, Nikolaos, Muthukrishnan, Shanmugavelayutham, Srivastava, Divesh
Primary Examiner(s)
Le, Debbie M.

Application Number

US10/174,218
Time in Patent Office

1,359 Days
Field of Search

707 1- 7, 707100-102, 704/1, 704 4- 9
US Class Current

1/1
CPC Class Codes

G06F 16/332   Query formulation

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99943   Generating database or data...

Method of performing approximate substring indexing

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

10 Claims

Specification

Solutions

Use Cases

Quick Links

Method of performing approximate substring indexing

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

10 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links