Scalable system for K-means clustering of large databases
First Claim
1. In a computer data processing system, a method for clustering data in a database comprising the steps of:
- a) choosing a cluster number K for use in categorizing the data in the database into K different clusters;
b) accessing data records from a database and bringing a data portion into a rapid access memory;
c) assigning data records from the data portion to one of the K different clusters and determining a mean of the data records assigned to a given cluster;
d) summarizing at least some of the data assigned to the clusters, storing a summarization of the data within the rapid access memory;
e) accessing an additional portion of the data records in the database and bringing said additional portion into the rapid access memory;
f) again assigning data from the database to a cluster and determining an updated mean from the summarized data and the additional portion of data records; and
g) evaluating a criteria to determine if further data should be accessed from the database to continue clustering of data from the database.
3 Assignments
0 Petitions
Accused Products
Abstract
In one exemplary embodiment the invention provides a data mining system for use in evaluating data in a database. Before the data evaulation begins a choice is made of a cluster number K for use in categorizing the data in the database into K different clusters and initial guesses at the means, or centriods, of each cluster are provided. Then a portion of the data in the database is read from a storage medium and brought into a rapid access memory. Data contained in the data portion is used to update the original guesses at the centroids of each of the K clusters. Some of the data belonging to a cluster is summarized or compressed and stored as a summarization of the data. More data is accessed from the database and assigned to a cluster. An updated mean for the clusters is determined from the summarized data and the newly acquired data. A stopping criteria is evaluated to determine if further data should be accessed from the database. If further data is needed to characterize the clusters, more data is gathered from the database and used in combination with already compressed data until the stopping criteria has been met.
-
Citations
32 Claims
-
1. In a computer data processing system, a method for clustering data in a database comprising the steps of:
-
a) choosing a cluster number K for use in categorizing the data in the database into K different clusters; b) accessing data records from a database and bringing a data portion into a rapid access memory; c) assigning data records from the data portion to one of the K different clusters and determining a mean of the data records assigned to a given cluster; d) summarizing at least some of the data assigned to the clusters, storing a summarization of the data within the rapid access memory; e) accessing an additional portion of the data records in the database and bringing said additional portion into the rapid access memory; f) again assigning data from the database to a cluster and determining an updated mean from the summarized data and the additional portion of data records; and g) evaluating a criteria to determine if further data should be accessed from the database to continue clustering of data from the database. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
-
-
17. A computer readable medium having stored thereon a data structure, comprising:
-
a) a first data portion containing a model representation of data stored on a database wherein the model includes K data clusters and wherein the model includes a mean for each cluster and a number of data point assigned to each cluster; b) a second data portion containing sufficient statistics of a portion of the data in the database; and c) a third data portion containing individual data records obtained from the database for use with the sufficient statistics to determine said model representation contained in the first data portion. - View Dependent Claims (18, 19)
-
-
20. In a computer data mining system, apparatus for evaluating data in a database comprising:
-
a) one or more data storage devices for storing a database of records on a storage medium; b) a computer having an interface to the storage devices for reading data from the storage medium and store the data into a rapid access memory for subsequent evaluation; and c) said computer comprising a processing unit for evaluating at least some of the data in the database and for characterizing the data into multiple numbers of data clusters;
said processing unit programmed to retrieve a subset of data from the database into the rapid access memory, evaluate the subset of data by assigning data records to one of the multiple number of data clusters, and produce a summarization of at least some of the retrieved data before retrieving additional data from the database. - View Dependent Claims (21, 22, 23, 24, 25)
-
-
26. In a computer data mining system, a method for use in evaluating data in a database comprising the steps of:
-
a) choosing a cluster number K for use in categorizing the data in the database into K different data clusters; b) choosing an initial centroid for each of the K data clusters; c) sampling a portion of the data in the database from a storage medium to bring said portion of data within a rapid access memory; d) assigning individual data records contained within the portion of data to a cluster based on nearness to a cluster centroid and updating the centroids for the clusters based on the data from the database to define a clustering model; e) compressing some of the data into data records of sufficient statistics for evaluating a clustering model; and f) continuing to sample data from the database, assigning data from the database to a cluster and determining an updated centroid from the data and the sufficient statistics until an evaluating criteria has been satisfied indicating the centroids have been adequately determined from a sampling of a subset of data within the database or until all data within the database has been evaluated. - View Dependent Claims (27)
-
-
28. In a computer data mining system, a method for evaluating data in a database that is stored on a storage medium comprising the steps of:
-
a) initializing multiple storage areas for storing multiple cluster models of the data in the database; b) obtaining a portion of the data in the database from a storage medium and assigning data records to the clusters of the multiple cluster models; c) using a clustering criteria to characterize a clustering of data from the portion of data obtained from the database for each model; d) summarizing at least some of the data contained within the portion of data based upon a compression criteria to produce sufficient statistics for the data satisfying the compression criteria; and e) continuing to obtain portions of data from the database and characterizing the clustering of data in the database from newly sampled data and the sufficient statistics for each of the multiple cluster models until a specified criteria has been satisfied. - View Dependent Claims (29, 30, 31)
-
-
32. A computer readable medium for storing program instructions for performing the steps of:
-
a) choosing a cluster number K for use in categorizing the data in the database into K different data clusters; b) choosing an initial centroid for each of the K data clusters; c) sampling a portion of the data in the database from a storage medium to bring said portion of data within a rapid access memory; d) assigning individual data records contained within the portion of data to a cluster based on nearness to a cluster centroid and updating the centroids for the clusters based on the data from the database to define a clustering model; e) compressing some of the data into data records of sufficient statistics for evaluating a clustering model; and f) continuing to sample data from the database, assigning data from the database to a cluster and determining an updated centroid from the data and the sufficient statistics until an evaluating criteria has been satisfied indicating the centroids have been adequately determined from a sampling of a subset of data within the database or until all data within the database has been evaluated.
-
Specification