Database query optimization using clustering data mining
First Claim
1. A method of optimizing a database query, said method comprising:
- a computer system receiving a database table populated with data;
said computer system scanning said database table;
said computer system determining statistics and single column histograms that describe data included in single columns of said database table;
said computer system estimating cardinality based on said statistics and said single column histograms;
said computer system determining all possible correlations among multiple columns by performing clustering data mining, wherein one or more columns of said multiple columns are included in said database table, and wherein said performing clustering data mining includes segregating said data that populates said database table into a plurality of clusters, each cluster having a corresponding rule whose conditions are matched by one or more rows of said database table, and further having a corresponding support count that indicates a number of rows of said database table that satisfy said rule;
said computer system ranking said multiple columns based on said determined correlations;
said computer system determining top ranked columns of said multiple columns based on said ranking;
said computer system determining said estimated cardinality differs from said corresponding support count by more than a threshold amount;
in response to said determining said estimated cardinality differs from said corresponding support count by more than said threshold amount, said computer system determining multiple column histograms based on said top ranked columns; and
said computer system generating an optimal query plan based on said multiple column histograms.
1 Assignment
0 Petitions
Accused Products
Abstract
A method and system for optimizing a database query. A database table populated with data is received and scanned. Statistics and single column histograms associated with single columns of the table are determined. Cardinality based on the statistics and histograms is estimated. All possible correlations among multiple columns are determined by performing clustering data mining that partitions data in the table into clusters. Top ranked columns based on the correlations are determined. The difference between the estimated cardinality and a support count of a cluster is determined to exceed a threshold, and in response, multiple column histograms based on the top ranked columns are determined. An optimal query plan based on the multiple column histograms is generated.
54 Citations
20 Claims
-
1. A method of optimizing a database query, said method comprising:
-
a computer system receiving a database table populated with data; said computer system scanning said database table; said computer system determining statistics and single column histograms that describe data included in single columns of said database table; said computer system estimating cardinality based on said statistics and said single column histograms; said computer system determining all possible correlations among multiple columns by performing clustering data mining, wherein one or more columns of said multiple columns are included in said database table, and wherein said performing clustering data mining includes segregating said data that populates said database table into a plurality of clusters, each cluster having a corresponding rule whose conditions are matched by one or more rows of said database table, and further having a corresponding support count that indicates a number of rows of said database table that satisfy said rule; said computer system ranking said multiple columns based on said determined correlations; said computer system determining top ranked columns of said multiple columns based on said ranking; said computer system determining said estimated cardinality differs from said corresponding support count by more than a threshold amount; in response to said determining said estimated cardinality differs from said corresponding support count by more than said threshold amount, said computer system determining multiple column histograms based on said top ranked columns; and said computer system generating an optimal query plan based on said multiple column histograms. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A computer system comprising:
-
a central processing unit (CPU); a computer-readable memory coupled to said CPU; and a computer-readable tangible storage device coupled to said CPU, said storage device including program code configured to be carried out by said CPU via said memory to implement a method of optimizing a database query, said method comprising; receiving a database table populated with data; scanning said database table; determining statistics and single column histograms that describe data included in single columns of said database table; estimating cardinality based on said statistics and said single column histograms; determining all possible correlations among multiple columns by performing clustering data mining, wherein one or more columns of said multiple columns are included in said database table, and wherein said performing clustering data mining includes segregating said data that populates said database table into a plurality of clusters, each cluster having a corresponding rule whose conditions are matched by one or more rows of said database table, and further having a corresponding support count that indicates a number of rows of said database table that satisfy said rule; ranking said multiple columns based on said determined correlations; determining top ranked columns of said multiple columns based on said ranking; determining said estimated cardinality differs from said corresponding support count by more than a threshold amount; in response to said determining said estimated cardinality differs from said corresponding support count by more than said threshold amount, determining multiple column histograms based on said top ranked columns; and generating an optimal query plan based on said multiple column histograms. - View Dependent Claims (7, 8, 9, 10)
-
-
11. A computer program product comprising a computer-readable, tangible storage device coupled to a processor of a computer system, said storage device having computer-readable program code stored therein, said computer-readable program code containing instructions that are carried out by said processor to implement a method of optimizing a database query, said method comprising:
-
receiving a database table populated with data; scanning said database table; determining statistics and single column histograms that describe data included in single columns of said database table; estimating cardinality based on said statistics and said single column histograms; determining all possible correlations among multiple columns by performing clustering data mining, wherein one or more columns of said multiple columns are included in said database table, and wherein said performing clustering data mining includes segregating said data that populates said database table into a plurality of clusters, each cluster having a corresponding rule whose conditions are matched by one or more rows of said database table, and further having a corresponding support count that indicates a number of rows of said database table that satisfy said rule; ranking said multiple columns based on said determined correlations; determining top ranked columns of said multiple columns based on said ranking; determining said estimated cardinality differs from said corresponding support count by more than a threshold amount; in response to said determining said estimated cardinality differs from said corresponding support count by more than said threshold amount, determining multiple column histograms based on said top ranked columns; and generating an optimal query plan based on said multiple column histograms. - View Dependent Claims (12, 13, 14, 15)
-
-
16. A process for supporting computing infrastructure, said process comprising providing at least one support service for at least one of creating, integrating, hosting, maintaining, and deploying computer-readable code in a computer system comprising a processor, wherein said processor carries out instructions contained in said code causing said computer system to perform a method of optimizing a database query, said method comprising:
-
said computer system receiving a database table populated with data; said computer system scanning said database table; said computer system determining statistics and single column histograms that describe data included in single columns of said database table; said computer system estimating cardinality based on said statistics and said single column histograms; said computer system determining all possible correlations among multiple columns by performing clustering data mining, wherein one or more columns of said multiple columns are included in said database table, and wherein said performing clustering data mining includes segregating said data that populates said database table into a plurality of clusters, each cluster having a corresponding rule whose conditions are matched by one or more rows of said database table, and further having a corresponding support count that indicates a number of rows of said database table that satisfy said rule; said computer system ranking said multiple columns based on said determined correlations; said computer system determining top ranked columns of said multiple columns based on said ranking; said computer system determining said estimated cardinality differs from said corresponding support count by more than a threshold amount; in response to said determining said estimated cardinality differs from said corresponding support count by more than said threshold amount, said computer system determining multiple column histograms based on said top ranked columns; and said computer system generating an optimal query plan based on said multiple column histograms. - View Dependent Claims (17, 18, 19, 20)
-
Specification