Multi-dimensional database record compression utilizing optimized cluster models
First Claim
1. A method of querying a database containing data records stored in the database comprising the steps of:
- a) clustering data from data records having multiple dimensions that are stored on a database to provide an initial cluster model having an initial probability distribution describing data records in the database;
b) comparing the initial probability distribution with a representative sample of records in the database to determine a sufficiency of said initial probability distribution;
c) modifying the cluster model to provide an adjusted cluster model that characterizes the data in the database, said modifying step performed by finding a region within an attribute space of the data records of high discrepancy between the initial cluster model and a data sample gathered from the database and increasing the cluster number of the cluster model and reclustering at least a portion of the data in the database to produce the adjusted cluster model to reduce discrepancies between the initial probability distribution and data sample from the database; and
d) determining a sum or a count of data records from the database falling within specified ranges of the multiple dimensions by integrating a functional representation based on the probablity distribution of the adjusted cluster model over the ranges.
2 Assignments
0 Petitions
Accused Products
Abstract
Apparatus and method for use in querying a database containing data records. The database is characterized by a compression scheme to provide data clustering information. In accordance with a exemplary embodiment of the invention a functional representation of data clustering is a Gaussian and the queries are performing by integrating the Gaussian corresponding to each of the data clusters over the ranges to determine the sum or the count of data records from the database that fall within the selected ranges. The process chooses a value for the cluster number K. The cluster model is next broken up into areas (tiles) based on user defined parameters. Data from the database is then classified based on the tiling information. A sorted version of the classified data, ordered by cluster number and then by the tile number within the cluster is generated. This data is then evaluated to test the sufficiency of the model created during the clustering.
-
Citations
39 Claims
-
1. A method of querying a database containing data records stored in the database comprising the steps of:
-
a) clustering data from data records having multiple dimensions that are stored on a database to provide an initial cluster model having an initial probability distribution describing data records in the database;
b) comparing the initial probability distribution with a representative sample of records in the database to determine a sufficiency of said initial probability distribution;
c) modifying the cluster model to provide an adjusted cluster model that characterizes the data in the database, said modifying step performed by finding a region within an attribute space of the data records of high discrepancy between the initial cluster model and a data sample gathered from the database and increasing the cluster number of the cluster model and reclustering at least a portion of the data in the database to produce the adjusted cluster model to reduce discrepancies between the initial probability distribution and data sample from the database; and
d) determining a sum or a count of data records from the database falling within specified ranges of the multiple dimensions by integrating a functional representation based on the probablity distribution of the adjusted cluster model over the ranges. - View Dependent Claims (2, 3, 4, 5)
-
-
6. Apparatus for querying a database containing data records comprising:
-
a) means for providing a functional representation of data clustering for multiple data records stored on a database;
b) means for adjusting a fit between the functional representation and the data in the database to provide a more accurate functional representation of the data clustering by segmenting clusters into cluster segments and comparing an expected data density in the cluster segments based on the initial cluster model with an observed data density of data gathered from the database;
c) means for selecting ranges over dimensions of the data for determining a sum or a count of data records from the database falling within the ranges; and
d) means for determining the sum or the count of data records from the database that fall within the selected ranges by integrating the functional representation from each cluster of a multiple number of clusters over the ranges. - View Dependent Claims (7, 8)
-
-
9. Data mining apparatus for querying a database containing data records comprising:
-
a ) a memory device for storing a database comprising multiple data records organized into data fields having a dimension n for storing data record items;
b) a computer having one or more processing units for executing a stored computer program, said computer including a rapid access memory store; and
c) an interface for coupling the memory device for storing the database to the computer to allow records to be retrieved from the database;
whereind) said computer executing a stored program having software components including i) a component for providing a first data clustering model for multiple number of data records stored on a database, sampling data from the database, and producing an updated cluster model based on a comparison between the first data clustering model and the sampled data;
ii) a component for selecting ranges over dimensions of the data for determining a sum or a count of data records from the database falling within the ranges; and
iii) a component for integrating a functional representation based on the updated cluster model from each cluster over the ranges to determine the sum or the count of data records from the database that fall within the selected ranges;
wherein an amount of the computer'"'"'s rapid access memory store is allocated for storing outlying data records determined not to fit within any of the K clusters which are counted separately when determining the sum or the count. - View Dependent Claims (10)
-
-
11. A method of querying data records containing continuous attributes that are stored in the database;
- said method comprising the steps of;
a) defining a data cube over which to query the database by selecting ranges over one or more of the continuous attributes of the data records in a database;
b) providing a functional representation of data clustering from data records stored in the database;
said functional representation identifying a distribution probability of said data records by the steps of;
i) performing an intial clustering of data to provide an initial cluster model;
ii) determining regions of an attribute data space where the initial cluster model does not accurately characterize data in the database; and
iii) reclustering the data to produce a second cluster model which more accurately characterizes the data of the regions of the attribute data space found to not accurately charactaerize data in the database; and
c) determining a count of data records from the database that fall within the data 16 cube by integrating the functional representation of the second cluster model over the selected rangs. - View Dependent Claims (12, 13, 14, 15, 16)
having mixture weights W1 which represent a fraction of the database represented by a data cluster and wherein the number k is the number of such database clusters.
- said method comprising the steps of;
-
17. A computer-readable medium having computer executable instructions for performing steps comprising:
-
a) clustering data from data records having multiple dimensions that are stored on a database to provide an initial cluster model having an initial probability distribution describing data records in the database;
b) comparing the initial probability distribution with a representative sample of records in the database to determine a sufficiency of said initial probability distribution;
c) modifying the cluster model to provide an adjusted cluster model that better characterizes the data in the database, said modifying step performed by finding a region within an attribute space of the data records of high discrepancy between the initial cluster model and a data sample gathered from the database and increasing the cluster number of the cluster model and reclustering at least a portion of the data in the database to produce the adjusted cluster model to reduce discrepancies between the initial probability distribution and data sample from the database; and
d) determining a sum or a count of data records from the database falling within specified ranges of the multiple dimensions by integrating a functional representation based on the probability distribution of the adjusted cluster model over the ranges. - View Dependent Claims (18, 19, 20, 21, 22)
-
-
23. For use with a data mining system for evaluating a database made up of data records having multiple attributes;
- a clustering process comprising the steps of;
a) choosing a first cluster number to use in clustering data contained in a database;
b) clustering the data records from the database using the first cluster number to produce a database clustering model;
c) evaluating the database clustering model by identifying regions wherein a density of data records in the database differs from data distributions predicted by the database clustering model by;
i) dividing an attribute space of the database into multiple partitions;
ii) assigning a specified number of partitions to each of the multiple data record attributes based upon the database clustering model; and
iii) determining a density of data records within the multiple partitions to evaluate the sufficiency of the database clustering model; and
d) based on the identification of said regions, adjusting the cluster number and again clustering the data within the database to produce a subsequent database clustering model having a different cluster number. - View Dependent Claims (24, 25, 26, 27, 28, 29, 30, 31, 32, 33)
- a clustering process comprising the steps of;
-
34. Apparatus for use with a data mining system for evaluating a database made up of data records having multiple attributes comprising:
-
a) means for choosing a first cluster number to use in clustering data contained in a database;
b) means for clustering the data records from the database using the first cluster number to produce a database clustering model;
c) means for evaluating the database clustering model by identifying regions wherein a density of data records in the database differs from data distributions predicted by the database clustering model by;
i) dividing an attribute space of the database into multiple partitions;
ii) assigning a specified number of partitions to each of the multiple data record attributes based upon the database clustering model; and
iii) determining a density of data records within the multiple partitions to evaluate the sufficiency of the database clustering model; and
d) means for adjusting the cluster number based on the identification of said regions and again clustering the data within the database to produce a subsequent database clustering model having a different cluster number.
-
-
35. A method of querying a database containing data records stored in the database comprising the steps of:
-
a) clustering data from data records having multiple attributes including one or more attributes containing continuous, nondiscrete data that are stored on a database to provide an initial cluster model having an initial probability distribution describing data records in the database and wherein each cluster attribute is characterized by a Gaussian functional depiction of the cluster;
b) comparing the initial probability distribution with a representative sample of records in the database to determine a sufficiency of said initial probability distribution;
c) modifying the cluster model to provide an adjusted cluster model that characterizes the data in the database, said modifying step performed by adjusting the cluster model to reduce discrepancies between the initial probability distribution and data sampled from the database; and
d) determining a sum or a count of data records from the database falling within specified ranges of the multiple dimensions by integrating a functional representation based on the probability distribution of the adjusted cluster model over the ranges. - View Dependent Claims (36)
-
-
37. A method of querying a database containing data records stored in the database comprising the steps of:
-
a) clustering data from data records having multiple attributes that are stored on a database to provide an initial cluster model having an initial probability distribution describing data records in the database derived from a covariance matrix based on the data in the database;
b) comparing the initial probability distribution with a representative sample of records in the database to determine a sufficiency of said initial probability distribution;
c) modifying the cluster model to provide an adjusted cluster model that characterizes the data in the database, said modifying step performed by adjusting the cluster model to reduce discrepancies between the initial probability distribution and data sampled from the database; and
d) determining a sum or a count of data records from the database falling within specified ranges of the multiple dimensions by integrating a functional representation based on the probability distribution of the adjusted cluster model over the ranges.
-
-
38. A computer-readable medium having computer executable instructions for performing steps comprising:
-
a) clustering data from data records having multiple attributes including one or more attributes containing continuous, nondiscrete data that are stored on a database to provide an initial cluster model having an initial probability distribution describing data records in the database and wherein each cluster attribute is characterized by a Gaussian functional depiction of the cluster;
b) comparing the initial probability distribution with a representative sample of records in the database to determine a sufficiency of said initial probability distribution;
c) modifying the cluster model to provide an adjusted cluster model that characterizes the data in the database, said modifying step performed by adjusting the cluster model to reduce discrepancies between the initial probability distribution and data sampled from the database; and
d) determining a sum or a count of data records from the database falling within specified ranges of the multiple dimensions by integrating a functional representation based on the probability distribution of the adjusted cluster model over the ranges.
-
-
39. A computer-readable medium having computer executable instructions for performing steps comprising:
-
a) clustering data from data records having multiple attributes that are stored on a database to provide an initial cluster model having an initial probability distribution describing data records in the database derived from a covariance matrix based on the data in the database;
b) comparing the initial probability distribution with a representative sample of records in the database to determine a sufficiency of said initial probability distribution;
c) modifying the cluster model to provide an adjusted cluster model that characterizes the data in the database, said modifying step performed by adjusting the cluster model to reduce discrepancies between the initial probability distribution and data sampled from the database; and
d) determining a sum or a count of data records from the database falling within specified ranges of the multiple dimensions by integrating a functional representation based on the probability distribution of the adjusted cluster model over the ranges.
-
Specification