Varying cluster number in a scalable clustering system for use with large databases

US 6,449,612 B1
Filed: 06/30/2000
Issued: 09/10/2002
Est. Priority Date: 03/17/1998
Status: Expired due to Term

First Claim

Patent Images

1. In a computer system, a method for characterizing data into clusters comprising the steps of:

a) providing a candidate cluster set for characterizing a database of data stored on a storage medium, wherein the candidate cluster set includes two or more clustering models having a different number of cluster in their clustering model;

b) reading a data portion from the database and determining how the data portion fits clustering model within the candidate cluster set;

c) choosing a best fit of the data portion to determine a selected clustering model from the candidate cluster set and then using the cluster number of said selected clustering model to update the selected clustering model using data portions from the database; and

d) updating the clustering model using newly sampled data from the database until a specified clustering criteria has been satisfied.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

In one exemplary embodiment the invention provides a data mining system for use in finding cluster of data items in a database or any other data storage medium. A portion of the data in the database is read from a storage medium and brought into a rapid access memory buffer whose size is determined by the user or operating system depending on available memory resources. Data contained in the data buffer is used to update the original model data distributions in each of the K clusters in a clustering model. Some of the data belonging to a cluster is summarized or compressed and stored as a reduced form of the data representing sufficient statistics of the data. More data is accessed from the database and the models are updated. An updated set of parameters for the clusters is determined from the summarized data (sufficient statistics) and the newly acquired data. Stopping criteria are evaluated to determine if further data should be accessed from the database. Each time the data is read from the database, a holdout set of data is used to evaluate the model then current as well as other possible cluster models chosen from a candidate set of cluster models. The evaluation of the holdout data set allows a cluster model with a different cluster number K′ to be chosen if that model more accurately models the data based upon the evaluation of the holdout set.

Citations

30 Claims

1. In a computer system, a method for characterizing data into clusters comprising the steps of:
- a) providing a candidate cluster set for characterizing a database of data stored on a storage medium, wherein the candidate cluster set includes two or more clustering models having a different number of cluster in their clustering model;
  
  b) reading a data portion from the database and determining how the data portion fits clustering model within the candidate cluster set;
  
  c) choosing a best fit of the data portion to determine a selected clustering model from the candidate cluster set and then using the cluster number of said selected clustering model to update the selected clustering model using data portions from the database; and
  
  d) updating the clustering model using newly sampled data from the database until a specified clustering criteria has been satisfied.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
- - 2. The process of claim 1 wherein if a change in cluster number is made during the step of choosing a best fit the change is to a clustering model having a larger cluster number.
  - 3. The method of claim 1 wherein the step of reading data includes the substep of maintaining a holdout data set from the data gathered from the database for use in choosing the best fit.
  - 4. The method of claim 1 wherein said updating step maintains a data structure that contains compressed data that defines multiple data subclusters different from the K clusters in a current clustering model and wherein the step of providing a candidate set of cluster models adds one or more subclusters as additional clusters to a current clustering model.
  - 5. The method of claim 4 wherein the candidate set of cluster models is chosen from a multiple number of subclusters organized according to size for determining which of said one or more subclusters are added as clusters to a current clustering model.
  - 6. The method of claim 4 wherein each of the multiple data subclusters is evaluated in sequence as a candidate additional cluster in the current clustering model.
  - 7. The method of claim 6 additionally comprising the step of maintaining a buffer of sufficient statistics and wherein each subcluster is added to a current clustering model to form a candidate clustering model and wherein the sufficient statistics are then used to update the candidate clustering model, said updated candidate clustering model then compared with the current clustering model to choose the best fit.
  - 8. The method of claim 7 wherein the step of comparing is performed by fitting a test set of data from the database that is used to score the current clustering model and the updated candidate clustering model.
  - 9. The method of claim 8 wherein the step of fitting the test data is performed by evaluating the log likelihood of the test data over a function representing the candidate cluster models.
  - 10. The method of claim 1 additionally comprising the step of maintaining a buffer of sufficient statistics representing data from the database used in creating a current clustering model and wherein clusters that make up a current clustering model are evaluated as candidate clusters for removal from the current clustering model to reduce the cluster number by removing each candidate cluster and reclustering the reduced cluster number candidate clustering model using the sufficient statistics from the buffer and comparing the current clustering model to the candidate clustering model having a reduced cluster number.
  - 11. The method of claim 1 wherein the step of determining to update the clustering number is based on evaluating a holdout set to determine if a sufficient number of records in the holdout set are accurately modeled by the current model.
  - 12. The method of claim 6 additionally comprising the step of maintaining a buffer of sufficient statistics that includes sufficient statistics for a plurality of subclusters and wherein the cluster number is increased by adding clusters corresponding to the subclusters until a sufficient percentage of the test set data points are sufficiently characterized by the model having a larger cluster number.
  - 13. The method of claim 4 wherein the candidate set of cluster models is chosen based on K cluster model functions and c additional cluster model functions.
  - 14. The method of claim 6 wherein one or more of the cluster models in the candidate set has a cluster number less than K.
  - 15. The method of claim 1 wherein the step of updating the cluster model is performed using an expectation maximization clustering process.
  - 16. The method of claim 1 wherein the step of evaluating the candidate set containing two or more clustering models is performed each time data is obtained from the database.

17. A computer readable medium having stored thereon a data structure, comprising:
- a) a first storage portion for storing a data clustering model from data gathered from a database;
  
  said clustering model including a number of model summaries equal to a cluster number wherein a model summary for a given cluster comprises a summation of weighting factors from multiple data records;
  
  b) a second storage portion for storing sufficient statistics of at least some of the data records obtained from the database;
  
  c) a third storage portion containing individual data records obtained from the database for use with the sufficient statistics in deter said clustering model; and
  
  d) said third storage portion including a holdout data portion for use in evaluating the sufficiency of the cluster model and adjusting the cluster number of said model.
- View Dependent Claims (18)
- - 18. The computer readable medium of claim 17 additionally comprising an additional storage medium for storing data records for access by a computer processing unit for allows the data records in the additional storage medium to brought into the third storage portion and then rewritten to the additional storage medium for sequential access.

19. In a computer data mining system, apparatus for evaluating data in a database comprising:
- a) one or more data storage devices for storing a database of records on a storage medium;
  
  b) a computer having an interface to the storage devices for reading data from the storage medium and bring the data into a rapid access memory for subsequent evaluation; and
  
  c) said computer comprising a processing unit for evaluating at least some of the data in the database and for characterizing the data into multiple numbers of data clusters;
  
  said processing unit programmed to retrieve data records from the database into the rapid access memory, evaluate the data records contribution to the multiple number of data clusters based upon an existing data model, and then summarize at least some of the data before retrieving additional data from the database to build a cluster model from the retrieved data, d) wherein said processing unit comprises means for maintaining a data structure that contains DS, CS, and RS data and further wherein the processing unit comprises means for choosing a cluster number K from data in the DS, CS and RS data structures and providing a cluster model based on the chosen cluster number.

20. In a computer system, apparatus for characterizing data into clusters comprising the steps of:
- a) means for providing a candidate cluster set for characterizing a database of data stored on a storage medium, wherein the candidate cluster set includes two or more clustering models having a different number of clusters in their clustering model;
  
  b) means for reading a data portion from the database and determining how the data portion fits clustering models within the candidate cluster set;
  
  c) means for choosing a best fit of the data portion to determine a clustering model from the candidate cluster set and then using the cluster number of said selected clustering model to update the selected clustering model using data portions from the database; and
  
  d) means for updating the clustering model using newly sampled data from the database until a specified clustering criteria has been satisfied.
- View Dependent Claims (21, 22)
- - 21. The apparatus of claim 20 additionally comprising means for maintaining a buffer of sufficient statistics representing data from the database used in creating a current clustering model and wherein clusters that make up a current clustering model are evaluated as candidate clusters for removal from the current clustering model to reduce the cluster number by removing each candidate cluster and reclustering the reduced cluster number candidate clustering model using the sufficient statistics from the buffer and comparing the current clustering model to the candidate clustering model having a reduced cluster number.
  - 22. The apparatus of claim 20 additionally comprising means for maintaining a buffer of sufficient statistics representing data from the database used in creating a current clustering model including a plurality of subclusters not included in the clusters of an existing clustering model and wherein clusters said plurality of subclusters are evaluated as candidate clusters for addition to the current clustering model to increase the cluster number by adding candidate subclusters and reclustering the increased cluster number candidate clustering model using the sufficient statistics from the buffer and comparing the current clustering model to the candidate clustering model having an increased cluster number.

23. A computer readable medium having computer-executable instructions for performing steps comprising:
- a) providing a candidate cluster set for characterizing a database of data stored on a storage medium, wherein the candidate cluster set includes two or more clustering models having a different number of clusters in their clustering model;
  
  b) reading a data portion from the database and determining how the data portion fits clustering models within the candidate cluster set;
  
  c) choosing a best fit of the data portion to determine a selected clustering model from the candidate cluster set and then using the cluster number of said selected clustering model to update the selected clustering model using data portions from the database; and
  
  d) updating the clustering model using newly sampled data from the database until a specified clustering criteria has been satisfied.
- View Dependent Claims (24, 25, 26, 27, 28, 29, 30)
- - 24. The computer readable medium of claim 23 wherein the step of reading data includes the substep of maintaining a holdout data set from the data gathered from the database for use in choosing the best fit.
  - 25. The computer readable medium of claim 23 wherein said updating step maintains a data structure that contains compressed data that defines multiple data subclusters different from the K clusters in a current clustering model and wherein the step of providing a candidate set of cluster models adds one or more subclusters as additional clusters to a current clustering model.
  - 26. The computer readable medium of claim 25 wherein the candidate set of cluster models is chosen from a multiple number of subclusters organized according to size for determining which of said one or more subclusters are added as clusters to a current clustering model.
  - 27. The computer readable medium of claim 26 additionally comprising the step of maintaining a buffer of sufficient statistics and wherein each subcluster is added to a current clustering model to form a candidate clustering model and wherein the sufficient statistics are then used to update the candidate cluster model, said updated candidate clustering model then compared with the current clustering model to choose the best fit.
  - 28. The computer readable medium of claim 27 wherein the step of comparing is performed by fitting a test set of data from the database that is used to score the current clustering model and the updated candidate clustering model.
  - 29. The computer readable medium of claim 28 wherein the step of fitting the test data is performed by evaluating the log likelihood of the test data over a function representing the candidate cluster models.
  - 30. The computer readable medium of claim 23 additionally comprising the step of maintaining a buffer of sufficient statistics representing data from the database used in creating a current clustering model and wherein clusters that make up a current clustering model are evaluated as candidate clusters for removal from the current clustering model to reduce the cluster number by removing each candidate cluster and reclustering the reduced cluster number candidate clustering model using the sufficient statistics from the buffer and comparing the current clustering model to the candidate clustering model having a reduced cluster number.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Fayyad, Usama, Bradley, Paul S.
Primary Examiner(s)
Amsbury, Wayne
Assistant Examiner(s)
HAVAN, THU THAO

Application Number

US09/607,365
Time in Patent Office

802 Days
Field of Search

707/6, 707/101, 707/102, 707/104.1, 707/201, 707/7, 704/5, 704/245, 704/243, 704/244, 709/106, 382/225, 382/226, 382/227
US Class Current

1/1
CPC Class Codes

G06F 16/30   of unstructured textual dat...

G06F 18/217   Validation; Performance eva...

G06F 18/2321   using statistics or functio...

G06F 2216/03   Data mining

Y10S 707/99936   Pattern matching access

Y10S 707/99945   Object-oriented database st...

Y10S 707/99952   Coherency, e.g. same view t...

Varying cluster number in a scalable clustering system for use with large databases

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

30 Claims

Specification

Solutions

Use Cases

Quick Links

Varying cluster number in a scalable clustering system for use with large databases

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

30 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links