SELECTION OF A SET OF OPTIMAL N-GRAMS FOR INDEXING STRING DATA IN A DBMS SYSTEM UNDER SPACE CONSTRAINTS INTRODUCED BY THE SYSTEM
First Claim
1. A system for selecting a set of n-grams for indexing string data in a database management system (DBMS) in relation to resources available to the DBMS, comprising:
- means for providing a set of candidate n-grams, each n-gram comprising a sequence of characters;
means for receiving an n-gram space constraint to define an amount “
k”
of the set of candidate n-grams eligible for a minimal set of n-grams, the n-gram space constraint based on resources available to the DBMS;
means for comparing each of the candidate n-grams from the provided set of candidate n-grams with sample queries and database records to determine a benefit associated with the candidate n-grams in reducing false hits;
means for selecting the minimal set of n-grams, the minimal set of n-grams having a highest total benefit and being subject to the n-gram space constraint;
means for selecting an updated minimal set of n-grams responsive to receiving an updated n-gram space constraint, the updated minimal set of n-grams having a highest total benefit and being subject to the updated n-gram space constraint, wherein the updated minimal set of n-grams consists of no more than “
k”
n-grams; and
means for generating an index, based on the minimal set of selected n-grams or the updated minimal set of n-grams, that indexes string data contained in the database records.
0 Assignments
0 Petitions
Accused Products
Abstract
The present invention provides a computer-readable medium and system for selecting a set of n-grams for indexing string data in a DBMS system. Aspects of the invention include providing a set of candidate in grams, each n-gram comprising a sequence of characters; identifying sample queries having character strings containing the candidate n-grams; and based on the set of candidate n-grams, the sample queries, database records, and an n-gram space constraint, automatically selecting, given the space constraint, a minimal set of an n-grams from the set of candidate n-grams that minimizes the number of false hits for the set of sample queries had the sample queries been executed against the database records.
-
Citations
22 Claims
-
1. A system for selecting a set of n-grams for indexing string data in a database management system (DBMS) in relation to resources available to the DBMS, comprising:
-
means for providing a set of candidate n-grams, each n-gram comprising a sequence of characters; means for receiving an n-gram space constraint to define an amount “
k”
of the set of candidate n-grams eligible for a minimal set of n-grams, the n-gram space constraint based on resources available to the DBMS;means for comparing each of the candidate n-grams from the provided set of candidate n-grams with sample queries and database records to determine a benefit associated with the candidate n-grams in reducing false hits; means for selecting the minimal set of n-grams, the minimal set of n-grams having a highest total benefit and being subject to the n-gram space constraint; means for selecting an updated minimal set of n-grams responsive to receiving an updated n-gram space constraint, the updated minimal set of n-grams having a highest total benefit and being subject to the updated n-gram space constraint, wherein the updated minimal set of n-grams consists of no more than “
k”
n-grams; andmeans for generating an index, based on the minimal set of selected n-grams or the updated minimal set of n-grams, that indexes string data contained in the database records. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A system for selecting a set of n-grams for indexing string data in a database management system (DBMS), wherein, the DBMS includes a set of candidate n-grams, a set of sample queries, a set of database records, and an input space constraint ‘
- k’
related to resources allocated to the DBMS, the system comprising;means for determining for each sample query from the set of sample queries, which n-grams in the candidate set are a substring of the sample query, and in response, forming a connection between the sample query and the n-gram; means for determining for each database record from the set of database records, which n-grams do not exist as a substring of the database record, and in response forming a connection between the database record and the n-gram; means for receiving the input space constraint ‘
k’
, the input space constraint ‘
k’
directly related to resources allocated to the DBMS;means for calculating a benefit of each n-gram in the candidate set, wherein the benefit of each n-gram is computed as a number of previously uncovered connections between the queries and the database records that are made through the n-gram, wherein a connection made through an n-gram between a query and a database record comprises a query-record pair; means for identifying the n-gram with the highest computed benefit, and storing the identified n-gram in a selected n-gram set; means for recomputing the benefits of remaining n-grams in the candidate set by performing the calculating means and the identifying means ‘
k’
times, thereby resulting in the selected n-gram set having ‘
k’
n-grams, such that the number of query-record pairs connected through best n-grams is a maximum over all possible sets of best n-gram sets;means for generating an index, based on the selected n-gram set; and means for using the generated index by the DBMS to service incoming queries, whereby the generated index created from the selected n-gram set minimizes false hits because the selected n-gram set maximizes reachability among unique query-record pairs and maximizes rejections. - View Dependent Claims (8, 9, 10, 11)
- k’
-
12. A computer-readable medium containing program instructions to be executed on a computer, the program instructions for selecting a set of n-grams for indexing string data in a database management system (DBMS) in relation to resources available to the DBMS, the program instructions comprising:
-
providing a set of candidate n-grams, each n-gram comprising a sequence of characters; receiving an n-gram space constraint to define an amount “
k”
of the set of candidate n-grams eligible for a minimal set of n-grams, the n-gram space constraint based on resources available to the DBMS;comparing each of the candidate n-grams from the provided set of candidate n-grams with sample queries and database records to determine a benefit associated with the candidate n-grams in reducing false hits; selecting the minimal set of n-grams, the minimal set of n-grams having a highest total benefit and being subject to the n-gram space constraint; selecting an updated minimal set of n-grams responsive to receiving an updated n-gram space constraint, the updated minimal set of n-grams having a highest total benefit and being subject to the updated n-gram space constraint, wherein the updated minimal set of n-grams consists of no more than “
k”
n-grams; andgenerating an index, based on the minimal set of selected n-grams or the updated minimal set of n-grams, that indexes string data contained in the database records. - View Dependent Claims (13, 14, 15, 16, 17)
-
-
18. A computer-readable medium containing program instructions to be executed on a computer, the program instructions for selecting a set of n-grams for indexing string data in a database management system (DBMS), wherein, the DBMS includes a set of candidate n-grams, a set of sample queries, a set of database records, and an input space constraint ‘
- k’
related to resources allocated to the DBMS, the program instructions comprising;determining for each sample query from the set of sample queries, which n-grams in the candidate set are a substring of the sample query, and in response, forming a connection between the sample query and the n-gram; determining for each database record from the set of database records, which n-grams do not exist as a substring of the database record, and in response forming a connection between the database record and the n-gram; receiving the input space constraint ‘
k’
, the input space constraint ‘
k’
directly related to resources allocated to the DBMS;calculating a benefit of each n-gram in the candidate set, wherein the benefit of each n-gram is computed as a number of previously uncovered connections between the queries and the database records that are made through the n-gram, wherein a connection made through an n-gram between a query and a database record comprises a query-record pair; identifying the n-gram with the highest computed benefit, and storing the identified n-gram in a selected n-gram set; recomputing the benefits of remaining n-grams in the candidate set by performing the calculating and the identifying ‘
k’
times, thereby resulting in the selected n-gram set having ‘
k’
n-grams, such that the number of query-record pairs connected through best n-grams is a maximum over all possible sets of best n-gram sets;generating an index, based on the selected n-gram set; and using the generated index by the DBMS to service incoming queries, whereby the generated index created from the selected n-gram set minimizes false hits because the selected n-gram set maximizes reachability among unique query-record pairs and maximizes rejections. - View Dependent Claims (19, 20, 21, 22)
- k’
Specification