Database query optimization using clustering data mining

US 8,229,917 B1
Filed: 02/24/2011
Issued: 07/24/2012
Est. Priority Date: 02/24/2011
Status: Expired due to Fees

First Claim

Patent Images

1. A method of optimizing a database query, said method comprising:

a computer system receiving a database table populated with data;

said computer system scanning said database table;

said computer system determining statistics and single column histograms that describe data included in single columns of said database table;

said computer system estimating cardinality based on said statistics and said single column histograms;

said computer system determining all possible correlations among multiple columns by performing clustering data mining, wherein one or more columns of said multiple columns are included in said database table, and wherein said performing clustering data mining includes segregating said data that populates said database table into a plurality of clusters, each cluster having a corresponding rule whose conditions are matched by one or more rows of said database table, and further having a corresponding support count that indicates a number of rows of said database table that satisfy said rule;

said computer system ranking said multiple columns based on said determined correlations;

said computer system determining top ranked columns of said multiple columns based on said ranking;

said computer system determining said estimated cardinality differs from said corresponding support count by more than a threshold amount;

in response to said determining said estimated cardinality differs from said corresponding support count by more than said threshold amount, said computer system determining multiple column histograms based on said top ranked columns; and

said computer system generating an optimal query plan based on said multiple column histograms.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and system for optimizing a database query. A database table populated with data is received and scanned. Statistics and single column histograms associated with single columns of the table are determined. Cardinality based on the statistics and histograms is estimated. All possible correlations among multiple columns are determined by performing clustering data mining that partitions data in the table into clusters. Top ranked columns based on the correlations are determined. The difference between the estimated cardinality and a support count of a cluster is determined to exceed a threshold, and in response, multiple column histograms based on the top ranked columns are determined. An optimal query plan based on the multiple column histograms is generated.

54 Citations

View as Search Results

20 Claims

1. A method of optimizing a database query, said method comprising:
- a computer system receiving a database table populated with data;
  
  said computer system scanning said database table;
  
  said computer system determining statistics and single column histograms that describe data included in single columns of said database table;
  
  said computer system estimating cardinality based on said statistics and said single column histograms;
  
  said computer system determining all possible correlations among multiple columns by performing clustering data mining, wherein one or more columns of said multiple columns are included in said database table, and wherein said performing clustering data mining includes segregating said data that populates said database table into a plurality of clusters, each cluster having a corresponding rule whose conditions are matched by one or more rows of said database table, and further having a corresponding support count that indicates a number of rows of said database table that satisfy said rule;
  
  said computer system ranking said multiple columns based on said determined correlations;
  
  said computer system determining top ranked columns of said multiple columns based on said ranking;
  
  said computer system determining said estimated cardinality differs from said corresponding support count by more than a threshold amount;
  
  in response to said determining said estimated cardinality differs from said corresponding support count by more than said threshold amount, said computer system determining multiple column histograms based on said top ranked columns; and
  
  said computer system generating an optimal query plan based on said multiple column histograms.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The method of claim 1, further comprising said computer system receiving a parameter k, wherein said segregating said data into said plurality of clusters includes carrying out a k-means algorithm that partitions n cases indicated by rows of said database table into k clusters.
  - 3. The method of claim 1, wherein said determining all possible correlations among multiple columns includes:
    - determining said multiple columns are within said database table; and
      
      determining said possible correlations during a regularly scheduled occurrence of said determining said statistics based on index key definitions or determining said possible correlations on demand based on access and filter predicates.
  - 4. The method of claim 1, wherein said determining all possible correlations among multiple columns includes:
    - determining said multiple columns are in different database tables; and
      
      determining said possible correlations during a regularly scheduled occurrence of said determining said statistics based on foreign key definitions or determining said possible correlations on demand based on a join of access and filter predicates of said different database tables.
  - 5. The method of claim 1, wherein said determining all possible correlations among multiple columns by said performing clustering data mining includes automatically determining said possible correlations without a human manually analyzing data included in said multiple columns.

6. A computer system comprising:
- a central processing unit (CPU);
  
  a computer-readable memory coupled to said CPU; and
  
  a computer-readable tangible storage device coupled to said CPU, said storage device including program code configured to be carried out by said CPU via said memory to implement a method of optimizing a database query, said method comprising;
  
  receiving a database table populated with data;
  
  scanning said database table;
  
  determining statistics and single column histograms that describe data included in single columns of said database table;
  
  estimating cardinality based on said statistics and said single column histograms;
  
  determining all possible correlations among multiple columns by performing clustering data mining, wherein one or more columns of said multiple columns are included in said database table, and wherein said performing clustering data mining includes segregating said data that populates said database table into a plurality of clusters, each cluster having a corresponding rule whose conditions are matched by one or more rows of said database table, and further having a corresponding support count that indicates a number of rows of said database table that satisfy said rule;
  
  ranking said multiple columns based on said determined correlations;
  
  determining top ranked columns of said multiple columns based on said ranking;
  
  determining said estimated cardinality differs from said corresponding support count by more than a threshold amount;
  
  in response to said determining said estimated cardinality differs from said corresponding support count by more than said threshold amount, determining multiple column histograms based on said top ranked columns; and
  
  generating an optimal query plan based on said multiple column histograms.
- View Dependent Claims (7, 8, 9, 10)
- - 7. The computer system of claim 6, wherein said method further comprises receiving a parameter k, wherein said segregating said data into said plurality of clusters includes carrying out a k-means algorithm that partitions n cases indicated by rows of said database table into k clusters.
  - 8. The computer system of claim 6, wherein said determining all possible correlations among multiple columns includes:
    - determining said multiple columns are within said database table; and
      
      determining said possible correlations during a regularly scheduled occurrence of said determining said statistics based on index key definitions or determining said possible correlations on demand based on access and filter predicates.
  - 9. The computer system of claim 6, wherein said determining all possible correlations among multiple columns includes:
    - determining said multiple columns are in different database tables; and
      
      determining said possible correlations during a regularly scheduled occurrence of said determining said statistics based on foreign key definitions or determining said possible correlations on demand based on a join of access and filter predicates of said different database tables.
  - 10. The computer system of claim 6, wherein said determining all possible correlations among multiple columns by said performing clustering data mining includes automatically determining said possible correlations without a human manually analyzing data included in said multiple columns.

11. A computer program product comprising a computer-readable, tangible storage device coupled to a processor of a computer system, said storage device having computer-readable program code stored therein, said computer-readable program code containing instructions that are carried out by said processor to implement a method of optimizing a database query, said method comprising:
- receiving a database table populated with data;
  
  scanning said database table;
  
  determining statistics and single column histograms that describe data included in single columns of said database table;
  
  estimating cardinality based on said statistics and said single column histograms;
  
  determining all possible correlations among multiple columns by performing clustering data mining, wherein one or more columns of said multiple columns are included in said database table, and wherein said performing clustering data mining includes segregating said data that populates said database table into a plurality of clusters, each cluster having a corresponding rule whose conditions are matched by one or more rows of said database table, and further having a corresponding support count that indicates a number of rows of said database table that satisfy said rule;
  
  ranking said multiple columns based on said determined correlations;
  
  determining top ranked columns of said multiple columns based on said ranking;
  
  determining said estimated cardinality differs from said corresponding support count by more than a threshold amount;
  
  in response to said determining said estimated cardinality differs from said corresponding support count by more than said threshold amount, determining multiple column histograms based on said top ranked columns; and
  
  generating an optimal query plan based on said multiple column histograms.
- View Dependent Claims (12, 13, 14, 15)
- - 12. The program product of claim 11, wherein said method further comprises receiving a parameter k, wherein said segregating said data into said plurality of clusters includes carrying out a k-means algorithm that partitions n cases indicated by rows of said database table into k clusters.
  - 13. The program product of claim 11, wherein said determining all possible correlations among multiple columns includes:
    - determining said multiple columns are within said database table; and
      
      determining said possible correlations during a regularly scheduled occurrence of said determining said statistics based on index key definitions or determining said possible correlations on demand based on access and filter predicates.
  - 14. The program product of claim 11, wherein said determining all possible correlations among multiple columns includes:
    - determining said multiple columns are in different database tables; and
      
      determining said possible correlations during a regularly scheduled occurrence of said determining said statistics based on foreign key definitions or determining said possible correlations on demand based on a join of access and filter predicates of said different database tables.
  - 15. The program product of claim 11, wherein said determining all possible correlations among multiple columns by said performing clustering data mining includes automatically determining said possible correlations without a human manually analyzing data included in said multiple columns.

16. A process for supporting computing infrastructure, said process comprising providing at least one support service for at least one of creating, integrating, hosting, maintaining, and deploying computer-readable code in a computer system comprising a processor, wherein said processor carries out instructions contained in said code causing said computer system to perform a method of optimizing a database query, said method comprising:
- said computer system receiving a database table populated with data;
  
  said computer system scanning said database table;
  
  said computer system determining statistics and single column histograms that describe data included in single columns of said database table;
  
  said computer system estimating cardinality based on said statistics and said single column histograms;
  
  said computer system determining all possible correlations among multiple columns by performing clustering data mining, wherein one or more columns of said multiple columns are included in said database table, and wherein said performing clustering data mining includes segregating said data that populates said database table into a plurality of clusters, each cluster having a corresponding rule whose conditions are matched by one or more rows of said database table, and further having a corresponding support count that indicates a number of rows of said database table that satisfy said rule;
  
  said computer system ranking said multiple columns based on said determined correlations;
  
  said computer system determining top ranked columns of said multiple columns based on said ranking;
  
  said computer system determining said estimated cardinality differs from said corresponding support count by more than a threshold amount;
  
  in response to said determining said estimated cardinality differs from said corresponding support count by more than said threshold amount, said computer system determining multiple column histograms based on said top ranked columns; and
  
  said computer system generating an optimal query plan based on said multiple column histograms.
- View Dependent Claims (17, 18, 19, 20)
- - 17. The process of claim 16, wherein said method further comprises said computer system receiving a parameter k, wherein said segregating said data into said plurality of clusters includes carrying out a k-means algorithm that partitions n cases indicated by rows of said database table into k clusters.
  - 18. The process of claim 16, wherein said determining all possible correlations among multiple columns includes:
    - determining said multiple columns are within said database table; and
      
      determining said possible correlations during a regularly scheduled occurrence of said determining said statistics based on index key definitions or determining said possible correlations on demand based on access and filter predicates.
  - 19. The process of claim 16, wherein said determining all possible correlations among multiple columns includes:
    - determining said multiple columns are in different database tables; and
      
      determining said possible correlations during a regularly scheduled occurrence of said determining said statistics based on foreign key definitions or determining said possible correlations on demand based on a join of access and filter predicates of said different database tables.
  - 20. The process of claim 16, wherein said determining all possible correlations among multiple columns by said performing clustering data mining includes automatically determining said possible correlations without a human manually analyzing data included in said multiple columns.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Aneas, Fabio F. M., Katahira, Reinaldo T., Mariano, Alessandro B. A., da Rocha, Pedro H. V.
Primary Examiner(s)
Vy, Hung T

Application Number

US13/033,803
Time in Patent Office

516 Days
Field of Search

707/713, 707/714, 707/718, 707/999.002
US Class Current

707/713
CPC Class Codes

G06F 16/24545 Selectivity estimation or d...

Database query optimization using clustering data mining

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

54 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Database query optimization using clustering data mining

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

54 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links