Varying cluster number in a scalable clustering system for use with large databases
First Claim
1. In a computer system, a method for characterizing data into clusters comprising the steps of:
- a) providing a candidate cluster set for characterizing a database of data stored on a storage medium, wherein the candidate cluster set includes two or more clustering models having a different number of cluster in their clustering model;
b) reading a data portion from the database and determining how the data portion fits clustering model within the candidate cluster set;
c) choosing a best fit of the data portion to determine a selected clustering model from the candidate cluster set and then using the cluster number of said selected clustering model to update the selected clustering model using data portions from the database; and
d) updating the clustering model using newly sampled data from the database until a specified clustering criteria has been satisfied.
2 Assignments
0 Petitions
Accused Products
Abstract
In one exemplary embodiment the invention provides a data mining system for use in finding cluster of data items in a database or any other data storage medium. A portion of the data in the database is read from a storage medium and brought into a rapid access memory buffer whose size is determined by the user or operating system depending on available memory resources. Data contained in the data buffer is used to update the original model data distributions in each of the K clusters in a clustering model. Some of the data belonging to a cluster is summarized or compressed and stored as a reduced form of the data representing sufficient statistics of the data. More data is accessed from the database and the models are updated. An updated set of parameters for the clusters is determined from the summarized data (sufficient statistics) and the newly acquired data. Stopping criteria are evaluated to determine if further data should be accessed from the database. Each time the data is read from the database, a holdout set of data is used to evaluate the model then current as well as other possible cluster models chosen from a candidate set of cluster models. The evaluation of the holdout data set allows a cluster model with a different cluster number K′ to be chosen if that model more accurately models the data based upon the evaluation of the holdout set.
-
Citations
30 Claims
-
1. In a computer system, a method for characterizing data into clusters comprising the steps of:
-
a) providing a candidate cluster set for characterizing a database of data stored on a storage medium, wherein the candidate cluster set includes two or more clustering models having a different number of cluster in their clustering model;
b) reading a data portion from the database and determining how the data portion fits clustering model within the candidate cluster set;
c) choosing a best fit of the data portion to determine a selected clustering model from the candidate cluster set and then using the cluster number of said selected clustering model to update the selected clustering model using data portions from the database; and
d) updating the clustering model using newly sampled data from the database until a specified clustering criteria has been satisfied. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
-
-
17. A computer readable medium having stored thereon a data structure, comprising:
-
a) a first storage portion for storing a data clustering model from data gathered from a database;
said clustering model including a number of model summaries equal to a cluster number wherein a model summary for a given cluster comprises a summation of weighting factors from multiple data records;
b) a second storage portion for storing sufficient statistics of at least some of the data records obtained from the database;
c) a third storage portion containing individual data records obtained from the database for use with the sufficient statistics in deter said clustering model; and
d) said third storage portion including a holdout data portion for use in evaluating the sufficiency of the cluster model and adjusting the cluster number of said model. - View Dependent Claims (18)
-
-
19. In a computer data mining system, apparatus for evaluating data in a database comprising:
-
a) one or more data storage devices for storing a database of records on a storage medium;
b) a computer having an interface to the storage devices for reading data from the storage medium and bring the data into a rapid access memory for subsequent evaluation; and
c) said computer comprising a processing unit for evaluating at least some of the data in the database and for characterizing the data into multiple numbers of data clusters;
said processing unit programmed to retrieve data records from the database into the rapid access memory, evaluate the data records contribution to the multiple number of data clusters based upon an existing data model, and then summarize at least some of the data before retrieving additional data from the database to build a cluster model from the retrieved data,d) wherein said processing unit comprises means for maintaining a data structure that contains DS, CS, and RS data and further wherein the processing unit comprises means for choosing a cluster number K from data in the DS, CS and RS data structures and providing a cluster model based on the chosen cluster number.
-
-
20. In a computer system, apparatus for characterizing data into clusters comprising the steps of:
-
a) means for providing a candidate cluster set for characterizing a database of data stored on a storage medium, wherein the candidate cluster set includes two or more clustering models having a different number of clusters in their clustering model;
b) means for reading a data portion from the database and determining how the data portion fits clustering models within the candidate cluster set;
c) means for choosing a best fit of the data portion to determine a clustering model from the candidate cluster set and then using the cluster number of said selected clustering model to update the selected clustering model using data portions from the database; and
d) means for updating the clustering model using newly sampled data from the database until a specified clustering criteria has been satisfied. - View Dependent Claims (21, 22)
-
-
23. A computer readable medium having computer-executable instructions for performing steps comprising:
-
a) providing a candidate cluster set for characterizing a database of data stored on a storage medium, wherein the candidate cluster set includes two or more clustering models having a different number of clusters in their clustering model;
b) reading a data portion from the database and determining how the data portion fits clustering models within the candidate cluster set;
c) choosing a best fit of the data portion to determine a selected clustering model from the candidate cluster set and then using the cluster number of said selected clustering model to update the selected clustering model using data portions from the database; and
d) updating the clustering model using newly sampled data from the database until a specified clustering criteria has been satisfied. - View Dependent Claims (24, 25, 26, 27, 28, 29, 30)
-
Specification