Selection of a set of optimal n-grams for indexing string data in a DBMS system under space constraints introduced by the system
First Claim
1. A system for selecting a set of n-grams for indexing string data in a database management system (DBMS) in relation to resources available to the DBMS, comprising:
- a processor configured to;
provide a set of candidate n-grams, each n-gram comprising a sequence of characters;
receive an n-gram space constraint to define an amount “
k”
of the set of candidate n-grams eligible for a minimal set of n-grams, the n-gram space constraint based on resources available to the DBMS;
compare each of the candidate n-grams from the provided set of candidate n-grams with sample queries and database records to determine a benefit associated with the candidate n-grams in reducing false hits;
select the minimal set of n-grams, the minimal set of n-grams having a highest total benefit and being subject to the n-gram space constraint;
select an updated minimal set of n-grams responsive to receiving an updated n-gram space constraint, the updated minimal set of n-grams having a highest total benefit and being subject to the updated n-gram space constraint, wherein the updated minimal set of n-grams consists of no more than “
k”
n-grams; and
generate an index, based on the minimal set of selected n-grams or the updated minimal set of n-grams, that indexes string data contained in the database records.
0 Assignments
0 Petitions
Accused Products
Abstract
The present invention provides a computer-readable medium and system for selecting a set of n-grams for indexing string data in a DBMS system. Aspects of the invention include providing a set of candidate n-grams, each n-gram comprising a sequence of characters; identifying sample queries having character strings containing the candidate n-grams; and based on the set of candidate n-grams, the sample queries, database records, and an n-gram space constraint, automatically selecting, given the space constraint, a minimal set of an n-grams from the set of candidate n-grams that minimizes the number of false hits for the set of sample queries had the sample queries been executed against the database records.
-
Citations
22 Claims
-
1. A system for selecting a set of n-grams for indexing string data in a database management system (DBMS) in relation to resources available to the DBMS, comprising:
-
a processor configured to; provide a set of candidate n-grams, each n-gram comprising a sequence of characters; receive an n-gram space constraint to define an amount “
k”
of the set of candidate n-grams eligible for a minimal set of n-grams, the n-gram space constraint based on resources available to the DBMS;compare each of the candidate n-grams from the provided set of candidate n-grams with sample queries and database records to determine a benefit associated with the candidate n-grams in reducing false hits; select the minimal set of n-grams, the minimal set of n-grams having a highest total benefit and being subject to the n-gram space constraint; select an updated minimal set of n-grams responsive to receiving an updated n-gram space constraint, the updated minimal set of n-grams having a highest total benefit and being subject to the updated n-gram space constraint, wherein the updated minimal set of n-grams consists of no more than “
k”
n-grams; andgenerate an index, based on the minimal set of selected n-grams or the updated minimal set of n-grams, that indexes string data contained in the database records. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A system for selecting a set of n-grams for indexing string data in a database management system (DBMS), wherein the DBMS includes a set of candidate n-grams, a set of sample queries, a set of database records, and an input space constraint ‘
- k’
related to resources allocated to the DBMS, the system comprising;a processor configured to; determine for each sample query from the set of sample queries, which n-grams in the candidate set are a substring of the sample query, and in response, forming a connection between the sample query and the n-gram; determine for each database record from the set of database records, which n-grams do not exist as a substring of the database record, and in response forming a connection between the database record and the n-gram; receive the input space constraint ‘
k’
the input space constraint ‘
k’
directly related to resources allocated to the DBMS;calculate a benefit of each n-gram in the candidate set, wherein the benefit of each n-gram is computed as a number of previously uncovered connections between the queries and the database records that are made through the n-gram, wherein a connection made through an n-gram between a query and a database record comprises a query-record pair; identify the n-gram with the highest computed benefit, and storing the identified n-gram in a selected n-gram set; recompute the benefits of remaining n-grams in the candidate set by performing the calculating and the identifying ‘
k’
times, thereby resulting in the selected n-gram set having ‘
k’
n-grams, such that the number of query-record pairs connected through best n-grams is a maximum over all possible sets of best n-gram sets;generate an index, based on the selected n-gram set; and use the generated index by the DBMS to service incoming queries, whereby the generated index created from the selected n-gram set minimizes false hits because the selected n-gram set maximizes reachability among unique query-record pairs and maximizes rejections. - View Dependent Claims (8, 9, 10, 11)
- k’
-
12. A non-transitory computer-readable storage medium containing program instructions to be executed on a computer, the program instructions for selecting a set of n-grams for indexing string data in a database management system (DBMS) in relation to resources available to the DBMS, wherein the computer performs the following functions comprising:
-
providing a set of candidate n-grams, each n-gram comprising a sequence of characters; receiving an n-gram space constraint to define an amount “
k”
of the set of candidate n-grams eligible for a minimal set of n-grams, the n-gram space constraint based on resources available to the DBMS;comparing each of the candidate n-grams from the provided set of candidate n-grams with sample queries and database records to determine a benefit associated with the candidate n-grams in reducing false hits; selecting the minimal set of n-grams, the minimal set of n-grams having a highest total benefit and being subject to the n-gram space constraint; selecting an updated minimal set of n-grams responsive to receiving an updated n-gram space constraint, the updated minimal set of n-grams having a highest total benefit and being subject to the updated n-gram space constraint, wherein the updated minimal set of n-grams consists of no more than “
k”
n-grams; andgenerating an index, based on the minimal set of selected n-grams or the updated minimal set of n-grams, that indexes string data contained in the database records. - View Dependent Claims (13, 14, 15, 16, 17)
-
-
18. A non-transitory computer-readable storage medium containing program instructions to be executed on a computer, the program instructions for selecting a set of n-grams for indexing string data in a database management system (DBMS), wherein the DBMS includes a set of candidate n-grams, a set of sample queries, a set of database records, and an input space constraint ‘
- k’
related to resources allocated to the DBMS, the program instructions comprising;determining for each sample query from the set of sample queries, which n-grams in the candidate set are a substring of the sample query, and in response, forming a connection between the sample query and the n-gram; determining for each database record from the set of database records, which n-grams do not exist as a substring of the database record, and in response forming a connection between the database record and the n-gram; receiving the input space constraint ‘
k’
the input space constraint ‘
k’
directly related to resources allocated to the DBMS;calculating a benefit of each n-gram in the candidate set, wherein the benefit of each n-gram is computed as a number of previously uncovered connections between the queries and the database records that are made through the n-gram, wherein a connection made through an n-gram between a query and a database record comprises a query-record pair; identifying the n-gram with the highest computed benefit, and storing the identified n-gram in a selected n-gram set; recomputing the benefits of remaining n-grams in the candidate set by performing the calculating and the identifying ‘
k’
times, thereby resulting in the selected n-gram set having ‘
k’
n-grams, such that the number of query-record pairs connected through best n-grams is a maximum over all possible sets of best n-gram sets;generating an index, based on the selected n-gram set; and using the generated index by the DBMS to service incoming queries, whereby the generated index created from the selected n-gram set minimizes false hits because the selected n-gram set maximizes reachability among unique query-record pairs and maximizes rejections. - View Dependent Claims (19, 20, 21, 22)
- k’
Specification