Automatically and adaptively determining execution plans for queries with parameter markers

US 7,958,113 B2
Filed: 05/22/2008
Issued: 06/07/2011
Est. Priority Date: 02/09/2007
Status: Expired due to Fees

First Claim

Patent Images

1. A computer-based method of automatically and adaptively determining query execution plans for queries having parameter markers, said method comprising:

generating, by a computing system, a first classifier trained by an initial set of training points;

dynamically updating, by a computing system at a first runtime thereof, at least one of a workload of queries processed by a database of said computing system and database statistics collected by said database for computing a plurality of selectivities;

collecting, by a computing system in an off-line phase thereof, said off-line phase being subsequent to said first runtime, a new set of training points, said collecting responsive to a detection of said dynamically updating;

modifying, by said computing system in said off-line phase, said first classifier into a second classifier, said modifying including utilizing said new set of training points;

receiving, by said computing system at a second runtime thereof, said second runtime being subsequent to said off-line phase, a query for said database, said query including one or more predicates, each predicate including one or more parameter markers bound to one or more actual values, and said one or more predicates associated with one or more selectivities of said plurality of selectivities in a one-to-one correspondence; and

automatically determining a query execution plan by said computing system, said automatically determining including mapping, by said second classifier, said one or more selectivities into said query execution plan, wherein said query execution plan is included in an augmented set of training points, said augmented set including said initial set and said new set,wherein said generating said first classifier comprises utilizing a machine learning technique, wherein said modifying said first classifier into said second classifier includes maintaining said first classifier incrementally, wherein said machine learning technique is a boosting technique, and wherein said method further comprises;

determining a subset of training points of said initial set of training points, said subset of training points belonging to a plurality of classes, each class having less than a predetermined threshold coverage of said initial set of training points;

assigning training points of said subset of training points to a single unclassified class;

setting a number of classes to k, wherein k is one plus a number of classes having greater than said predetermined threshold coverage;

generating an error-correcting output code (ECOC) table of length 2*k;

training a binary classifier as said first classifier, said training including utilizing AdaBoost with confidence-rated predictions for each column in said ECOC table, said training said binary classifier including;

initializing said augmented set of training points with equal weights; and

performing a training procedure for T rounds, each round of said training procedure comprising;

training a plurality of weak learners on said augmented set of training points,choosing a weak learner of said plurality of weak learners, said weak learner having a lowest training error of a plurality of training errors associated with said plurality of weak learners in a one-to-one correspondence,assigning a weight to said weak learner, said weight being a function of said lowest training error,assigning exponentially higher weight to any misclassified training points of said augmented set of training points, andassigning exponentially lower weight to any correctly classified training points of said augmented set of training points; and

said training said binary classifier further including;

outputting a model including T weak learners and T weights, said T weights associated with said T weak learners in a one-to-one correspondence, each weak learner of said T weak learners chosen by said choosing said weak learner in each round of said T rounds, and each weight of said T weights assigned by said assigning said weight in each round of said T rounds.

View all claims

0 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and system for automatically and adaptively determining query execution plans for parametric queries. A first classifier trained by an initial set of training points is generated. A query workload and/or database statistics are dynamically updated. A new set of training points is collected off-line. Using the new set of training points, the first classifier is modified into a second classifier. A database query is received at a runtime subsequent to the off-line phase. The query includes predicates having parameter markers bound to actual values. The predicates are associated with selectivities. A mapping of the selectivities into a plan determines the query execution plan. The determined query execution plan is included in an augmented set of training points, where the augmented set includes the initial set and the new set.

Citations

11 Claims

1. A computer-based method of automatically and adaptively determining query execution plans for queries having parameter markers, said method comprising:
- generating, by a computing system, a first classifier trained by an initial set of training points;
  
  dynamically updating, by a computing system at a first runtime thereof, at least one of a workload of queries processed by a database of said computing system and database statistics collected by said database for computing a plurality of selectivities;
  
  collecting, by a computing system in an off-line phase thereof, said off-line phase being subsequent to said first runtime, a new set of training points, said collecting responsive to a detection of said dynamically updating;
  
  modifying, by said computing system in said off-line phase, said first classifier into a second classifier, said modifying including utilizing said new set of training points;
  
  receiving, by said computing system at a second runtime thereof, said second runtime being subsequent to said off-line phase, a query for said database, said query including one or more predicates, each predicate including one or more parameter markers bound to one or more actual values, and said one or more predicates associated with one or more selectivities of said plurality of selectivities in a one-to-one correspondence; and
  
  automatically determining a query execution plan by said computing system, said automatically determining including mapping, by said second classifier, said one or more selectivities into said query execution plan, wherein said query execution plan is included in an augmented set of training points, said augmented set including said initial set and said new set,wherein said generating said first classifier comprises utilizing a machine learning technique, wherein said modifying said first classifier into said second classifier includes maintaining said first classifier incrementally, wherein said machine learning technique is a boosting technique, and wherein said method further comprises;
  
  determining a subset of training points of said initial set of training points, said subset of training points belonging to a plurality of classes, each class having less than a predetermined threshold coverage of said initial set of training points;
  
  assigning training points of said subset of training points to a single unclassified class;
  
  setting a number of classes to k, wherein k is one plus a number of classes having greater than said predetermined threshold coverage;
  
  generating an error-correcting output code (ECOC) table of length 2*k;
  
  training a binary classifier as said first classifier, said training including utilizing AdaBoost with confidence-rated predictions for each column in said ECOC table, said training said binary classifier including;
  
  initializing said augmented set of training points with equal weights; and
  
  performing a training procedure for T rounds, each round of said training procedure comprising;
  
  training a plurality of weak learners on said augmented set of training points,choosing a weak learner of said plurality of weak learners, said weak learner having a lowest training error of a plurality of training errors associated with said plurality of weak learners in a one-to-one correspondence,assigning a weight to said weak learner, said weight being a function of said lowest training error,assigning exponentially higher weight to any misclassified training points of said augmented set of training points, andassigning exponentially lower weight to any correctly classified training points of said augmented set of training points; and
  
  said training said binary classifier further including;
  
  outputting a model including T weak learners and T weights, said T weights associated with said T weak learners in a one-to-one correspondence, each weak learner of said T weak learners chosen by said choosing said weak learner in each round of said T rounds, and each weight of said T weights assigned by said assigning said weight in each round of said T rounds.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The method of claim 1, further comprising:
    - evaluating a weighted vote from said T weak learners of said model, said evaluating including providing an error-correcting output code;
      
      comparing said error-correcting output code to each row of a plurality of rows of said ECOC table;
      
      computing a plurality of Hamming distances based on said comparing; and
      
      predicting a class corresponding to a row of said plurality of rows of said ECOC table, said row having a lowest Hamming distance of said plurality of Hamming distances.
  - 3. The method of claim 1, further comprising adapting to a change in said workload by modifying said first classifier to said second classifier, wherein no new query execution plans meet the predetermined threshold coverage, said adapting comprising:
    - introducing said new set of training points in batches at time t, said t being indicated by a range of integers, wherein a lowest integer of said range of integers indicates a most recent batch of said batches;
      
      weighting each training point of said augmented set by an associated at value of α
      
      ^tset of α
      
      ^tvalues, wherein α
      
      is between 0 and 1, said weighting each training point resulting in a decrease of weights of older training points of said augmented set of training points;
      
      retiring any training point of said augmented set in response to said associated at value differing from zero by less than a first predetermined amount;
      
      training all binary classifiers for a predefined number of additional rounds;
      
      weighting each vote of said T weak learners in said model by an associated β
      
      ^tvalue of a set of β
      
      ^tvalues, wherein β
      
      is between 0 and 1, said weighting each vote resulting in an emphasis of one or more votes of a subset of most recently trained weak learners, said subset of most recently trained weak learners being a subset of said T weak learners; and
      
      retiring, from said model, any weak learner of said T weak learners in response to said associated β
      
      ^tvalue differing from zero by less than a second predetermined amount.
  - 4. The method of claim 1, further comprising adapting to a change in said workload, wherein a new query execution plan meets the predetermined threshold coverage and a query execution plan cache for storing said new query execution plan is not at capacity, said adapting comprising:
    - increasing a size of said ECOC table to accommodate a new class, said increasing resulting in an increased ECOC table;
      
      fully training additional binary classifiers for one or more new columns of said increased ECOC table, said one or more new columns not included in said ECOC table prior to said increasing; and
      
      retraining other binary classifiers for classes other than said new class, said retraining including training said other binary classifiers for a predetermined number of rounds to incorporate training data for said new class, said training for said predetermined number of rounds not including a full training of said other binary classifiers.
  - 5. The method of claim 1, further comprising adapting to a change in said workload, wherein a new query execution plan meets the predetermined threshold coverage and a query execution plan cache cannot store said new query execution plan without retiring an existing query execution plan of a plurality of existing query execution plans stored in said query plan cache, said adapting comprising:
    - selecting said existing query execution plan from said plurality of existing query execution plans;
      
      retiring one or more training points of said augmented training set, said one or more training points associated with a class of said existing query execution plan;
      
      maintaining said ECOC table with no changes;
      
      retraining one or more binary classifiers for a predefined number of rounds to incorporate said new set of training points, said retraining said one or more binary classifiers not including a full training of said one or more binary classifiers.

6. A computing system comprising:
- a processor; and
  
  a computer-readable memory unit coupled to said processor, said memory unit comprising a software application and instructions that when executed by said processor implement a method of automatically and adaptively determining query execution plans for queries having parameter markers, said method comprising;
  
  generating, by a computing system, a first classifier trained by an initial set of training points;
  
  dynamically updating, by a computing system at a first runtime thereof, at least one of a workload of queries processed by a database of said computing system and database statistics collected by said database for computing a plurality of selectivities;
  
  collecting, by a computing system in an off-line phase thereof, said off-line phase being subsequent to said first runtime, a new set of training points, said collecting responsive to a detection of said dynamically updating;
  
  modifying, by said computing system in said off-line phase, said first classifier into a second classifier, said modifying including utilizing said new set of training points;
  
  receiving, by said computing system at a second runtime thereof, said second runtime being subsequent to said off-line phase, a query for said database, said query including one or more predicates, each predicate including one or more parameter markers bound to one or more actual values, and said one or more predicates associated with one or more selectivities of said plurality of selectivities in a one-to-one correspondence; and
  
  automatically determining a query execution plan by said computing system, said automatically determining including mapping, by said second classifier, said one or more selectivities into said query execution plan, wherein said query execution plan is included in an augmented set of training points, said augmented set including said initial set and said new set,wherein said generating said first classifier comprises utilizing a machine learning technique, wherein said modifying said first classifier into said second classifier includes maintaining said first classifier incrementally, wherein said machine learning technique is a boosting technique, and wherein said method further comprises;
  
  determining a subset of training points of said initial set of training points, said subset of training points belonging to a plurality of classes, each class having less than a predetermined threshold coverage of said initial set of training points;
  
  assigning training points of said subset of training points to a single unclassified class;
  
  setting a number of classes to k, wherein k is one plus a number of classes having greater than said predetermined threshold coverage;
  
  generating an error-correcting output code (ECOC) table of length 2*k;
  
  training a binary classifier as said first classifier, said training including utilizing AdaBoost with confidence-rated predictions for each column in said ECOC table, said training said binary classifier including;
  
  initializing said augmented set of training points with equal weights; and
  
  performing a training procedure for T rounds, each round of said training procedure comprising;
  
  training a plurality of weak learners on said augmented set of training points,choosing a weak learner of said plurality of weak learners, said weak learner having a lowest training error of a plurality of training errors associated with said plurality of weak learners in a one-to-one correspondence,assigning a weight to said weak learner, said weight being a function of said lowest training error,assigning exponentially higher weight to any misclassified training points of said augmented set of training points, andassigning exponentially lower weight to any correctly classified training points of said augmented set of training points; and
  
  said training said binary classifier further including;
  
  outputting a model including T weak learners and T weights, said T weights associated with said T weak learners in a one-to-one correspondence, each weak learner of said T weak learners chosen by said choosing said each weak learner in each round of said T rounds, and each weight of said T weights assigned by said assigning said weight in each round of said T rounds.
- View Dependent Claims (7, 8, 9, 10)
- - 7. The computing system of claim 6, wherein said method further comprises:
    - evaluating a weighted vote from said T weak learners of said model, said evaluating including providing an error-correcting output code;
      
      comparing said error-correcting output code to each row of a plurality of rows of said ECOC table;
      
      computing a plurality of Hamming distances based on said comparing; and
      
      predicting a class corresponding to a row of said plurality of rows of said ECOC table, said row having a lowest Hamming distance of said plurality of Hamming distances.
  - 8. The computing system of claim 6, wherein said method further comprises adapting to a change in said workload by modifying said first classifier to said second classifier, wherein no new query execution plans meet the predetermined threshold coverage, said adapting comprising:
    - introducing said new set of training points in batches at time t, said t being indicated by a range of integers, wherein a lowest integer of said range of integers indicates a most recent batch of said batches;
      
      weighting each training point of said augmented set by an associated α
      
      ^tvalue of a set of α
      
      ^tvalues, wherein α
      
      is between 0 and 1, said weighting each training point resulting in a decrease of weights of older training points of said augmented set of training points;
      
      retiring any training point of said augmented set in response to said associated α
      
      ^tvalue differing from zero by less than a first predetermined amount;
      
      training all binary classifiers for a predefined number of additional rounds;
      
      weighting each vote of said T weak learners in said model by an associated β
      
      ^tvalue of a set of β
      
      ^tvalues, wherein β
      
      is between 0 and 1, said weighting each vote resulting in an emphasis of one or more votes of a subset of most recently trained weak learners, said subset of most recently trained weak learners being a subset of said T weak learners; and
      
      retiring, from said model, any weak learner of said T weak learners in response to said associated β
      
      ^tvalue differing from zero by less than a second predetermined amount.
  - 9. The computing system of claim 6, wherein said method further comprises adapting to a change in said workload, wherein a new query execution plan meets the predetermined threshold coverage and a query execution plan cache for storing said new query execution plan is not at capacity, said adapting comprising:
    - increasing a size of said ECOC table to accommodate a new class, said increasing resulting in an increased ECOC table;
      
      fully training additional binary classifiers for one or more new columns of said increased ECOC table, said one or more new columns not included in said ECOC table prior to said increasing; and
      
      retraining other binary classifiers for classes other than said new class, said retraining including training said other binary classifiers for a predetermined number of rounds to incorporate training data for said new class, said training for said predetermined number of rounds not including a full training of said other binary classifiers.
  - 10. The computing system of claim 6, wherein said method further comprises adapting to a change in said workload, wherein a new query execution plan meets the predetermined threshold coverage and a query execution plan cache cannot store said new query execution plan without retiring an existing query execution plan of a plurality of existing query execution plans stored in said query plan cache, said adapting comprising:
    - selecting said existing query execution plan from said plurality of existing query execution plans;
      
      retiring one or more training points of said augmented training set, said one or more training points associated with a class of said existing query execution plan;
      
      maintaining said ECOC table with no changes;
      
      retraining one or more binary classifiers for a predefined number of rounds to incorporate said new set of training points, said retraining said one or more binary classifiers not including a full training of said one or more binary classifiers.

11. A computer-based method of automatically and adaptively determining query execution plans for queries having parameter markers, said method comprising:
- generating, by a computing system, a first classifier trained by an initial set of training points, wherein the initial set of training points consists of selectivities;
  
  generating a set of random decision trees (RDTs), said set of RDTs having a predetermined number of RDTs, wherein said generating said set of RDTs includes defining a generation procedure for each RDT of said set of RDTs, wherein said defining said generation procedure includes;
  
  choosing a selectivity of a plurality of selectivities for a first node of an RDT of said set of RDTs, said chosen selectivity not used in a higher level node of said RDT'"'"'s hierarchy;
  
  selecting a decision threshold value for said chosen selectivity, said decision threshold value separating a set of query execution plans in said first node into two disjoint subsets of said set of query execution plans; and
  
  recursively using said generation procedure to expand said RDT for each subset of said two disjoint subsets until a number of query execution plans in a subset of said two disjoint subsets is fewer than a predefined minimum query execution plan threshold, a depth of said RDT reaches a depth threshold based on predefined criteria, or all query execution plans of said subset of said two disjoint subsets belong to a single type;
  
  dynamically updating, by a computing system at a first runtime thereof, at least one of a workload of queries processed by a database of said computing system and database statistics collected by said database for computing said plurality of selectivities;
  
  collecting, by a computing system in an off-line phase thereof, said off-line phase being subsequent to said first runtime, a new set of training points, said collecting responsive to a detection of said dynamically updating;
  
  modifying, by said computing system in said off-line phase, said first classifier into a second classifier, said modifying including utilizing said new set of training points;
  
  receiving, by said computing system at a second runtime thereof, said second runtime being subsequent to said off-line phase, a query for said database, said query including one or more predicates, each predicate including one or more parameter markers bound to one or more actual values; and
  
  automatically determining a query execution plan by a processor of said computing system, wherein said automatically determining includes;
  
  mapping, by said second classifier, one or more selectivities of said one or more predicates into said query execution plan, wherein said query execution plan is included in an augmented set of training points, said augmented set including said initial set of training points and said new set of training points;
  
  traversing a plurality of decision paths in said set of RDTs, wherein a decision path of said plurality of decision paths starts at a root of said RDT and ends at a leaf node of said RDT, wherein said traversing includes obtaining, across said set of RDTs, a set of posterior probabilities that are based on said augmented set of training points and said query execution plan, and obtaining, across said set of RDTs, one or more other sets of posterior probabilities that are based on one or more other query execution plans;
  
  computing a first average of said posterior probabilities included in said set of posterior probabilities;
  
  comparing said first average of said posterior probabilities to one or more other averages of said one or more other sets of posterior probabilities, wherein said comparing includes identifying said first average of said posterior probabilities as an optimal average selected from the group consisting of said first average of said posterior probabilities and said one or more other averages of said one or more other sets of posterior probabilities, and wherein said identifying said first average of said posterior probabilities as said optimal average includes utilizing a loss function;
  
  identifying said query execution plan as having the optimum average posterior probability, wherein said identifying said query execution plan is based on selecting said optimal average in response to said identifying said first average as said optimal average; and
  
  providing said query execution plan as a prediction of an output of a query optimizer of said database without utilizing said query optimizer to provide said output.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Stoyanovich, Julia, Lohman, Guy Maring, Rao, Jun, Simmen, David Everett, Fan, Wei, Markl, Volker Gerhard, Megiddo, Nimrod
Primary Examiner(s)
Mahmoudi; Tony
Assistant Examiner(s)
Weinrich; Brian E

Application Number

US12/125,221
Publication Number

US 20080222093A1
Time in Patent Office

1,111 Days
Field of Search

707/718, 709/235
US Class Current

707/718
CPC Class Codes

G06F 16/24545 Selectivity estimation or d...

Automatically and adaptively determining execution plans for queries with parameter markers

First Claim

0 Assignments

0 Petitions

Accused Products

Abstract

Citations

11 Claims

Specification

Solutions

Use Cases

Quick Links

Automatically and adaptively determining execution plans for queries with parameter markers

First Claim

0 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

11 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links