Selection of a set of optimal n-grams for indexing string data in a DBMS system under space constraints introduced by the system

US 8,001,128 B2
Filed: 11/04/2008
Issued: 08/16/2011
Est. Priority Date: 11/05/2004
Status: Expired due to Fees

First Claim

Patent Images

1. A system for selecting a set of n-grams for indexing string data in a database management system (DBMS) in relation to resources available to the DBMS, comprising:

a processor configured to;

provide a set of candidate n-grams, each n-gram comprising a sequence of characters;

receive an n-gram space constraint to define an amount “

k”

of the set of candidate n-grams eligible for a minimal set of n-grams, the n-gram space constraint based on resources available to the DBMS;

compare each of the candidate n-grams from the provided set of candidate n-grams with sample queries and database records to determine a benefit associated with the candidate n-grams in reducing false hits;

select the minimal set of n-grams, the minimal set of n-grams having a highest total benefit and being subject to the n-gram space constraint;

select an updated minimal set of n-grams responsive to receiving an updated n-gram space constraint, the updated minimal set of n-grams having a highest total benefit and being subject to the updated n-gram space constraint, wherein the updated minimal set of n-grams consists of no more than “

k”

n-grams; and

generate an index, based on the minimal set of selected n-grams or the updated minimal set of n-grams, that indexes string data contained in the database records.

View all claims

0 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The present invention provides a computer-readable medium and system for selecting a set of n-grams for indexing string data in a DBMS system. Aspects of the invention include providing a set of candidate n-grams, each n-gram comprising a sequence of characters; identifying sample queries having character strings containing the candidate n-grams; and based on the set of candidate n-grams, the sample queries, database records, and an n-gram space constraint, automatically selecting, given the space constraint, a minimal set of an n-grams from the set of candidate n-grams that minimizes the number of false hits for the set of sample queries had the sample queries been executed against the database records.

Citations

22 Claims

1. A system for selecting a set of n-grams for indexing string data in a database management system (DBMS) in relation to resources available to the DBMS, comprising:
- a processor configured to;
  
  provide a set of candidate n-grams, each n-gram comprising a sequence of characters;
  
  receive an n-gram space constraint to define an amount “
  
  k”
  
  of the set of candidate n-grams eligible for a minimal set of n-grams, the n-gram space constraint based on resources available to the DBMS;
  
  compare each of the candidate n-grams from the provided set of candidate n-grams with sample queries and database records to determine a benefit associated with the candidate n-grams in reducing false hits;
  
  select the minimal set of n-grams, the minimal set of n-grams having a highest total benefit and being subject to the n-gram space constraint;
  
  select an updated minimal set of n-grams responsive to receiving an updated n-gram space constraint, the updated minimal set of n-grams having a highest total benefit and being subject to the updated n-gram space constraint, wherein the updated minimal set of n-grams consists of no more than “
  
  k”
  
  n-grams; and
  
  generate an index, based on the minimal set of selected n-grams or the updated minimal set of n-grams, that indexes string data contained in the database records.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The system of claim 1 wherein, the processor uses:
    - the index to service incoming queries.
  - 3. The system of claim 1 wherein the n-gram selection is a problem that is NP-hard, the system further including:
    - the processor configured to formulate n-gram selection as a graph model, and solving the problem using an approximation algorithm.
  - 4. The system of claim 3 further including:
    - the processor configured to provide the approximation algorithm with a provable ratio bound of an optimal solution to the n-gram selection problem.
  - 5. The system of claim 3 further including:
    - the processor configured to implement the approximation algorithm as a utility of the DBMS.
  - 6. The system of claim 1 wherein the DBMS is a SQL-based DBMS, further including:
    - the processor configured to identify SQL “
      
      LIKE”
      
      queries containing the candidate n-grams.

7. A system for selecting a set of n-grams for indexing string data in a database management system (DBMS), wherein the DBMS includes a set of candidate n-grams, a set of sample queries, a set of database records, and an input space constraint ‘
- k’
  
  related to resources allocated to the DBMS, the system comprising;
  
  a processor configured to;
  
  determine for each sample query from the set of sample queries, which n-grams in the candidate set are a substring of the sample query, and in response, forming a connection between the sample query and the n-gram;
  
  determine for each database record from the set of database records, which n-grams do not exist as a substring of the database record, and in response forming a connection between the database record and the n-gram;
  
  receive the input space constraint ‘
  
  k’
  
  the input space constraint ‘
  
  k’
  
  directly related to resources allocated to the DBMS;
  
  calculate a benefit of each n-gram in the candidate set, wherein the benefit of each n-gram is computed as a number of previously uncovered connections between the queries and the database records that are made through the n-gram, wherein a connection made through an n-gram between a query and a database record comprises a query-record pair;
  
  identify the n-gram with the highest computed benefit, and storing the identified n-gram in a selected n-gram set;
  
  recompute the benefits of remaining n-grams in the candidate set by performing the calculating and the identifying ‘
  
  k’
  
  times, thereby resulting in the selected n-gram set having ‘
  
  k’
  
  n-grams, such that the number of query-record pairs connected through best n-grams is a maximum over all possible sets of best n-gram sets;
  
  generate an index, based on the selected n-gram set; and
  
  use the generated index by the DBMS to service incoming queries, whereby the generated index created from the selected n-gram set minimizes false hits because the selected n-gram set maximizes reachability among unique query-record pairs and maximizes rejections.
- View Dependent Claims (8, 9, 10, 11)
- - 8. The system of claim 7 wherein the processor is configured to:
    - implement an approximation algorithm as a utility of the DBMS.
  - 9. The system of claim 8 wherein the approximation algorithm provides a near optimal solution to an NP-hard n-gram selection problem, wherein the processor is configured to:
    - provide the approximation algorithm with a definable ratio bound of the optimal solution.
  - 10. The system of claim 9 wherein the definable ratio bound is:
  - 11. The system of claim 8 wherein the DBMS is a relational database system.

12. A non-transitory computer-readable storage medium containing program instructions to be executed on a computer, the program instructions for selecting a set of n-grams for indexing string data in a database management system (DBMS) in relation to resources available to the DBMS, wherein the computer performs the following functions comprising:
- providing a set of candidate n-grams, each n-gram comprising a sequence of characters;
  
  receiving an n-gram space constraint to define an amount “
  
  k”
  
  of the set of candidate n-grams eligible for a minimal set of n-grams, the n-gram space constraint based on resources available to the DBMS;
  
  comparing each of the candidate n-grams from the provided set of candidate n-grams with sample queries and database records to determine a benefit associated with the candidate n-grams in reducing false hits;
  
  selecting the minimal set of n-grams, the minimal set of n-grams having a highest total benefit and being subject to the n-gram space constraint;
  
  selecting an updated minimal set of n-grams responsive to receiving an updated n-gram space constraint, the updated minimal set of n-grams having a highest total benefit and being subject to the updated n-gram space constraint, wherein the updated minimal set of n-grams consists of no more than “
  
  k”
  
  n-grams; and
  
  generating an index, based on the minimal set of selected n-grams or the updated minimal set of n-grams, that indexes string data contained in the database records.
- View Dependent Claims (13, 14, 15, 16, 17)
- - 13. The non-transitory computer-readable storage medium of claim 12 further including:
    - using the index to service incoming queries.
  - 14. The non-transitory computer-readable storage medium of claim 12 wherein the n-gram selection is a problem that is NP-hard, the non-transitory computer-readable storage medium further including:
    - formulating n-gram selection as a graph model, and solving the problem using an approximation algorithm.
  - 15. The non-transitory computer-readable storage medium of claim 14 further including:
    - providing the approximation algorithm with a provable ratio bound of an optimal solution to the n-gram selection problem.
  - 16. The non-transitory computer-readable storage medium of claim 14 further including:
    - implementing the approximation algorithm as a utility of the DBMS.
  - 17. The non-transitory computer-readable storage medium of claim 12 wherein the DBMS is a SQL-based DBMS, further including:
    - identifying SQL “
      
      LIKE”
      
      queries containing the candidate n-grams.

18. A non-transitory computer-readable storage medium containing program instructions to be executed on a computer, the program instructions for selecting a set of n-grams for indexing string data in a database management system (DBMS), wherein the DBMS includes a set of candidate n-grams, a set of sample queries, a set of database records, and an input space constraint ‘
- k’
  
  related to resources allocated to the DBMS, the program instructions comprising;
  
  determining for each sample query from the set of sample queries, which n-grams in the candidate set are a substring of the sample query, and in response, forming a connection between the sample query and the n-gram;
  
  determining for each database record from the set of database records, which n-grams do not exist as a substring of the database record, and in response forming a connection between the database record and the n-gram;
  
  receiving the input space constraint ‘
  
  k’
  
  the input space constraint ‘
  
  k’
  
  directly related to resources allocated to the DBMS;
  
  calculating a benefit of each n-gram in the candidate set, wherein the benefit of each n-gram is computed as a number of previously uncovered connections between the queries and the database records that are made through the n-gram, wherein a connection made through an n-gram between a query and a database record comprises a query-record pair;
  
  identifying the n-gram with the highest computed benefit, and storing the identified n-gram in a selected n-gram set;
  
  recomputing the benefits of remaining n-grams in the candidate set by performing the calculating and the identifying ‘
  
  k’
  
  times, thereby resulting in the selected n-gram set having ‘
  
  k’
  
  n-grams, such that the number of query-record pairs connected through best n-grams is a maximum over all possible sets of best n-gram sets;
  
  generating an index, based on the selected n-gram set; and
  
  using the generated index by the DBMS to service incoming queries, whereby the generated index created from the selected n-gram set minimizes false hits because the selected n-gram set maximizes reachability among unique query-record pairs and maximizes rejections.
- View Dependent Claims (19, 20, 21, 22)
- - 19. The non-transitory computer-readable storage medium of claim 18 further comprising:
    - implementing an approximation algorithm as a utility of the DBMS.
  - 20. The non-transitory computer-readable storage medium of claim 19 wherein the approximation algorithm provides a near optimal solution to an NP-hard n-gram selection problem, the non-transitory computer-readable storage medium further comprising:
    - providing the approximation algorithm with a definable ratio bound of the optimal solution.
  - 21. The non-transitory computer-readable storage medium of claim 20 wherein the definable ratio bound is:
  - 22. The non-transitory computer-readable storage medium of claim 19 wherein the DBMS is a relational database system.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Iyer, Balakrishna Raghavendra, Mehrotra, Sharad, Hacigumus, Vahit Hakan
Primary Examiner(s)
Saeed; Usmaan

Application Number

US12/264,899
Publication Number

US 20090063404A1
Time in Patent Office

1,015 Days
Field of Search

None
US Class Current

707/741
CPC Class Codes

G06F 16/316   Indexing structures

G06F 16/322   Trees

Y10S 707/99932   Access augmentation or opti...

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99934   Query formulation, input pr...

Y10S 707/99935   Query augmenting and refini...

Y10S 707/99936   Pattern matching access

Y10S 707/99937   Sorting

Y10S 707/99942   Manipulating data structure...

Y10S 707/99943   Generating database or data...

Selection of a set of optimal n-grams for indexing string data in a DBMS system under space constraints introduced by the system

First Claim

0 Assignments

0 Petitions

Accused Products

Abstract

Citations

22 Claims

Specification

Solutions

Use Cases

Quick Links

Selection of a set of optimal n-grams for indexing string data in a DBMS system under space constraints introduced by the system

First Claim

0 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

22 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links