Selection of a set of optimal n-grams for indexing string data in a DBMS system under space constraints introduced by the system
First Claim
1. A method for selecting a set of n-grams for indexing string data in a DBMS system, comprising:
- providing a set of candidate n-grams, each n-gram comprising a sequence of characters;
identifying sample queries having character strings containing the candidate n-grams; and
based on the set of candidate n-grams, the sample queries, database records, and an n-gram space constraints, automatically selecting, given the space constraint, a minimal set of n-grams from the set of candidate n-grams that minimizes the number of false hits for the set of sample queries had the sample queries been executed against the database records.
1 Assignment
0 Petitions
Accused Products
Abstract
The present invention provides a computer-readable medium and system for selecting a set of n-grams for indexing string data in a DBMS system. Aspects of the invention include providing a set of candidate in grams, each n-gram comprising a sequence of characters; identifying sample queries having character strings containing the candidate n-grams; and based on the set of candidate n-grams, the sample queries, database records, and an n-gram space constraint, automatically selecting, given the space constraint, a minimal set of an n-grams from the set of candidate n-grams that minimizes the number of false hits for the set of sample queries had the sample queries been executed against the database records.
27 Citations
30 Claims
-
1. A method for selecting a set of n-grams for indexing string data in a DBMS system, comprising:
-
providing a set of candidate n-grams, each n-gram comprising a sequence of characters;
identifying sample queries having character strings containing the candidate n-grams; and
based on the set of candidate n-grams, the sample queries, database records, and an n-gram space constraints, automatically selecting, given the space constraint, a minimal set of n-grams from the set of candidate n-grams that minimizes the number of false hits for the set of sample queries had the sample queries been executed against the database records. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A method for selecting a set of n-grams for indexing string data in a DBMS system, wherein, the DBMS system includes a set of candidate n-grams, a set of sample queries, and a set of database records, the method comprising:
-
a) determine for each sample query, which n-grams in the candidate set are substring of the query, and in response, forming a connection between the sample query and the n-gram;
b) determine for each record, which n-grams do not exist as a substring of the record, and in response forming a connection between the record and the n-gram;
c) receiving an input space constraint, k;
d) calculating the benefit of each n-gram in the candidate set, where the benefit of each n-gram is computed as a number of previously uncovered connections between the queries and the records that are made through the n-gram, wherein a connection made through an n-gram between a query and a record comprises a query-record pair;
e) identifying the n-gram with the highest computed benefit, removing the n-gram from the candidate n-gram set, and storing the n-gram in a selected n-gram set;
f) recomputing the benefits of remaining n-grams in the candidate set by performing steps d) and f) k times, thereby resulting in the selected n-gram set having k n-grams, such that the number of record-query pairs connected through the best n-grams is maximum over all possible sets of best n-gram sets. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. An optimal n-gram indexing relational database management system (DBMS) capable of being executed in a computer system, comprising:
-
a set of sample queries;
a plurality of database records;
means for selecting an optimal set of n-grams from a set of candidate n-grams, the selecting means functional for, receiving as input a space constraint, k, for limiting the optimal set of n-grams; and
automatically selecting a set of n-grams of size k from the set of candidate n-grams that minimizes the number of false hits for the set of sample queries had the sample queries been executed against the database records. - View Dependent Claims (16, 17, 18, 19, 20, 21, 22, 23)
-
-
24. A computer-readable medium containing program instructions for selecting a set of n-grams for indexing string data in a DBMS system, the program instructions for:
-
providing a set of candidate n-grams, each n-gram comprising a sequence of characters;
identifying sample queries having character strings containing the candidate n-grams; and
based on the set of candidate n-grams, the sample queries, database records, and an n-gram space constraints, automatically selecting, given the space constraint, a minimal set of n-grams from the set of candidate n-grams that minimizes the number of false hits for the set of sample queries had the sample queries been executed against the database records. - View Dependent Claims (25, 26, 27, 28, 29, 30)
-
Specification