Scalable system for clustering of large databases having mixed data attributes
First Claim
1. In a computer data processing system, a method for clustering data in a database comprising the steps of:
- a) reading data records having both discrete and ordered attributes from a database storage medium and bringing a portion of the data records into a rapid access memory;
b) initializing a cluster model that characterizes the data within the database wherein the cluster model includes a table of probabilities for the enumerated or discrete data attributes of the data records for each cluster of a multiple number of clusters that make up the cluster model and wherein the cluster model for data attributes that are ordered comprises a mean and covariance for each cluster;
c) updating the cluster model from the database records brought into the rapid access memory;
d) summarizing at least some of the database records in the rapid access memory and storing a summarization within the rapid access memory;
e) evaluating a criteria to determine if further data should be accessed from the database to further cluster data records in the database; and
f) based on the evaluating step reading an additional number of records from the database storage medium and bringing said additional number of records into the rapid access memory for further updating of the cluster model.
2 Assignments
0 Petitions
Accused Products
Abstract
One exemplary embodiment of a scalable clustering algorithm accesses a database of records having attributes or data fields of both enumerated discrete and ordered values and brings a portion of the data records into a rapid access memory. A cluster model for the data includes a table of probabilities for the enumerated, discrete data fields of the data records. The cluster model for data fields that are ordered comprises a mean and spread of the cluster. The cluster model is updated from the database records brought into the rapid access memory. At least some of the database records in the rapid access memory are summarized and stored within the rapid access memory. A criteria is then evaluated to determine if further data should be accessed from the database to further cluster data records in the database. Based on the evaluating step, additional database records in the database are accessed and brought into the rapid access memory for further updating of the cluster model.
-
Citations
34 Claims
-
1. In a computer data processing system, a method for clustering data in a database comprising the steps of:
-
a) reading data records having both discrete and ordered attributes from a database storage medium and bringing a portion of the data records into a rapid access memory;
b) initializing a cluster model that characterizes the data within the database wherein the cluster model includes a table of probabilities for the enumerated or discrete data attributes of the data records for each cluster of a multiple number of clusters that make up the cluster model and wherein the cluster model for data attributes that are ordered comprises a mean and covariance for each cluster;
c) updating the cluster model from the database records brought into the rapid access memory;
d) summarizing at least some of the database records in the rapid access memory and storing a summarization within the rapid access memory;
e) evaluating a criteria to determine if further data should be accessed from the database to further cluster data records in the database; and
f) based on the evaluating step reading an additional number of records from the database storage medium and bringing said additional number of records into the rapid access memory for further updating of the cluster model. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
a) resetting a new model data structure to zero;
b) determining a weighted contribution to the new model for each unsummarized data point; and
c) determining a weighted contribution of the summarized data points to the new model.
-
-
11. The method of claim 10 further comprising the steps of:
-
a) providing an old model based upon a past modeling iteration;
b) comparing the old model to the new model; and
c) terminating the modeling process when the old model and new model are sufficiently similar.
-
-
12. The method of claim 1 wherein a probability that a data record belongs in a cluster for data records extracted from the database is calculated using a covariance matrix for the continuous attributes of the record.
-
13. The method of claim 1 wherein each cluster in the cluster model is characterized by a datapoint number for the cluster, a mean for each ordered data attribute, a covariance for each ordered data attribute and a probability table for each discrete data attribute and further wherein each data record read from the database storage medium contributes to an updating of the cluster model for at least one cluster.
-
14. The method of claim 13 wherein there are K clusters in the cluster model and wherein each data record contributes to a cluster model for each of the K clusters.
-
15. The method of claim 1 wherein the step of accessing database records is performed using a sequential scan of the database.
-
16. The method of claim 1 wherein the step of accessing database records is performed using a random index generator that does not repeat.
-
17. In a computer data mining system, apparatus for evaluating data in a database comprising:
-
a) one or more data storage devices for storing a database of data records on a storage medium, said data records including attributes of both discrete or enumerated data and ordered data;
b) a computer having a rapid access memory and an interface to the storage devices for reading data from the storage medium and bringing the data into said rapid access memory for subsequent evaluation; and
c) said computer comprising a processing unit for evaluating at least some of the data records in the database and for characterizing the data records into multiple numbers of data clusters, said processing unit programmed to retrieve a subset of data from the database into the rapid access memory, evaluate the subset of data to further characterize the database clustering using a clustering criteria, and produce a summarization of at least some of the retrieved data records before retrieving additional data records from the database;
said computer producing a cluster model that includes cluster probabilities for the discrete attributes and cluster means and covariance information for the ordered data in the rapid access memory during data clustering.- View Dependent Claims (18, 19, 20)
-
-
21. A computer readable medium having stored thereon a data structure, comprising:
-
a) a first data portion containing a model representation of data records stored in a database, wherein at least some of the database records include mixed data that includes both discrete data fields and continuous data fields;
b) a second data portion containing sufficient statistics of a portion of the data records in the database; and
c) a third data portion containing individual data records obtained from the database for use with the sufficient statistics to determine said model representation contained in the first data portion. - View Dependent Claims (22, 23, 24, 25, 26, 27, 28, 29)
-
-
30. A computer-readable medium having computer-executable components comprising:
-
a) a database component for interfacing with a database that stores data records containing both enumerated or discrete and ordered values;
b) a rapid access memory component for storing at least of subset of said data records gathered from the database for processing;
c) a modeling component for constructing and storing a model of said database by determining if a data record is sufficiently described by any of several clusters using cluster criteria and for updating said cluster model based on said data records and for evaluating whether further of said data records should be moved from said database into said rapid access memory for modeling. - View Dependent Claims (31, 32, 33, 34)
-
Specification