Scalable system for clustering of large databases
First Claim
1. A method for clustering data in a database that is stored on a storage medium comprising the steps of:
- a) obtaining a portion of the data in the database from a storage medium;
b) clustering data from the portion of data obtained from the database based upon a clustering criteria to produce a clustering model;
c) compressing at least some of the data contained within the portion of data by evaluating a data compression criteria based on the clustering model and producing sufficient statistics for the data satisfying the compression criteria;
d) storing the sufficient statistics for the data satisfying the compression criteria separate from the clustering model for use in subsequent refinement of said clustering model;
e) continuing to obtain portions of data from the database and refining the clustering model that characterizes data in the database from newly sampled data and the stored sufficient statistics for the data satisfying the compression criteria until a specified stopping criteria has been satisfied; and
f) displaying progress of the characterization of the clustering of the database on a user interface and providing a user controller input for stopping or suspending further building of a database clustering model.
3 Assignments
0 Petitions
Accused Products
Abstract
A data mining system for use in finding clusters of data items in a database or any other data storage medium. The clusters are used in categorizing the data in the database into K different clusters within each of M models. An initial set of estimates (or guesses) of the parameters of each model to be explored (e.g. centriods in K-means), of each cluster are provided from some source. Then a portion of the data in the database is read from a storage medium and brought into a rapid access memory buffer whose size is determined by the user or operating system depending on available memory resources. Data contained in the data buffer is used to update the original guesses at the parameters of the model in each of the K clusters over all M models. Some of the data belonging to a cluster is summarized or compressed and stored as a reduced form of the data representing sufficient statistics of the data. More data is accessed from the database and the models are updated. An updated set of parameters for the clusters is determined from the summarized data (sufficient statistics) and the newly acquired data. Stopping criteria are evaluated to determine if further data should be accessed from the database. If further data is needed to characterize the clusters, more data is gathered from the database and used in combination with already compressed data until the stopping criteria has been met.
220 Citations
20 Claims
-
1. A method for clustering data in a database that is stored on a storage medium comprising the steps of:
-
a) obtaining a portion of the data in the database from a storage medium;
b) clustering data from the portion of data obtained from the database based upon a clustering criteria to produce a clustering model;
c) compressing at least some of the data contained within the portion of data by evaluating a data compression criteria based on the clustering model and producing sufficient statistics for the data satisfying the compression criteria;
d) storing the sufficient statistics for the data satisfying the compression criteria separate from the clustering model for use in subsequent refinement of said clustering model;
e) continuing to obtain portions of data from the database and refining the clustering model that characterizes data in the database from newly sampled data and the stored sufficient statistics for the data satisfying the compression criteria until a specified stopping criteria has been satisfied; and
f) displaying progress of the characterization of the clustering of the database on a user interface and providing a user controller input for stopping or suspending further building of a database clustering model. - View Dependent Claims (2, 3, 4)
-
-
5. A computer readable medium having stored thereon a data structure, comprising:
-
a) a first data portion containing a model representation of data stored on a database wherein the model representation includes multiple clusters defined by a vector containing a summation of data records from the database having vector components corresponding to attributes of said data records;
b) a second data portion containing sufficient statistics of a portion of the data in the database which summarizes multiple clusters defined by a vector containing a summation of summarized data records from the database having vector components which are attributes of said data records;
c) a third data portion containing individual data records obtained from the database for use with the sufficient statistics to determine said model representation contained in the first data portion; and
d) wherein said sufficient statistics from the second data portion and said individual data records from the third data portion each contribute to said model representation contained in said first data portion. - View Dependent Claims (6, 7)
-
-
8. In a computer data mining system, apparatus for evaluating data in a database comprising:
-
a) one or more data storage devices for storing a database of records on a storage medium;
b) a computer having an interface to the data storage devices for reading data from the storage medium and a computer rapid access memory for storing data during subsequent data clustering evaluation; and
c) said computer comprising a processing unit for evaluating at least some of the data in the database and for characterizing the data into multiple numbers of data clusters;
said processing unit programmed to retrieve a subset of data from the database into the rapid access memory, evaluate the subset of data to produce a clustering model that characterizes the database using a clustering criteria, compressing at least some of the retrieved data by summarizing and storing the data in a compressed form separate from the clustering model for use in subsequent refinement of the clustering model and retrieve additional data from the database to update the clustering model based on the summarization of the previously retrieved data and the newly retrieved additional data.- View Dependent Claims (9, 10, 11, 12, 13)
-
-
14. In a computer data mining system, a method for evaluating data in a database that is stored on a storage medium to produce multiple clustering models in one scan of the database or less comprising the steps of:
-
a) initializing multiple storage areas for storing multiple cluster models of the data in the database;
b) obtaining a portion of the data in the database from a storage medium;
c) using a clustering criteria to characterize a clustering of data from the portion of data obtained from the database for each model;
d) compressing at least some of the data contained within the portion of data based upon a compression criteria to produce for storage separate from the clustering models sufficient statistics for the data satisfying the compression criteria for use in refining the clustering models; and
e) continuing to obtain portions of data from the database and characterizing the clustering of data in the database from newly sampled data and the stored sufficient statistics for each of the multiple cluster models until a specified clustering criteria has been satisfied for one or more of the multiple cluster models. - View Dependent Claims (15, 16, 17, 18, 19, 20)
-
Specification