Scalable system for clustering of large databases

US 6,374,251 B1
Filed: 03/17/1998
Issued: 04/16/2002
Est. Priority Date: 03/17/1998
Status: Expired due to Term

First Claim

Patent Images

1. A method for clustering data in a database that is stored on a storage medium comprising the steps of:

a) obtaining a portion of the data in the database from a storage medium;

b) clustering data from the portion of data obtained from the database based upon a clustering criteria to produce a clustering model;

c) compressing at least some of the data contained within the portion of data by evaluating a data compression criteria based on the clustering model and producing sufficient statistics for the data satisfying the compression criteria;

d) storing the sufficient statistics for the data satisfying the compression criteria separate from the clustering model for use in subsequent refinement of said clustering model;

e) continuing to obtain portions of data from the database and refining the clustering model that characterizes data in the database from newly sampled data and the stored sufficient statistics for the data satisfying the compression criteria until a specified stopping criteria has been satisfied; and

f) displaying progress of the characterization of the clustering of the database on a user interface and providing a user controller input for stopping or suspending further building of a database clustering model.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A data mining system for use in finding clusters of data items in a database or any other data storage medium. The clusters are used in categorizing the data in the database into K different clusters within each of M models. An initial set of estimates (or guesses) of the parameters of each model to be explored (e.g. centriods in K-means), of each cluster are provided from some source. Then a portion of the data in the database is read from a storage medium and brought into a rapid access memory buffer whose size is determined by the user or operating system depending on available memory resources. Data contained in the data buffer is used to update the original guesses at the parameters of the model in each of the K clusters over all M models. Some of the data belonging to a cluster is summarized or compressed and stored as a reduced form of the data representing sufficient statistics of the data. More data is accessed from the database and the models are updated. An updated set of parameters for the clusters is determined from the summarized data (sufficient statistics) and the newly acquired data. Stopping criteria are evaluated to determine if further data should be accessed from the database. If further data is needed to characterize the clusters, more data is gathered from the database and used in combination with already compressed data until the stopping criteria has been met.

220 Citations

20 Claims

1. A method for clustering data in a database that is stored on a storage medium comprising the steps of:
- a) obtaining a portion of the data in the database from a storage medium;
  
  b) clustering data from the portion of data obtained from the database based upon a clustering criteria to produce a clustering model;
  
  c) compressing at least some of the data contained within the portion of data by evaluating a data compression criteria based on the clustering model and producing sufficient statistics for the data satisfying the compression criteria;
  
  d) storing the sufficient statistics for the data satisfying the compression criteria separate from the clustering model for use in subsequent refinement of said clustering model;
  
  e) continuing to obtain portions of data from the database and refining the clustering model that characterizes data in the database from newly sampled data and the stored sufficient statistics for the data satisfying the compression criteria until a specified stopping criteria has been satisfied; and
  
  f) displaying progress of the characterization of the clustering of the database on a user interface and providing a user controller input for stopping or suspending further building of a database clustering model.
- View Dependent Claims (2, 3, 4)
- - 2. The method of claim 1 wherein the displaying step comprises the step of displaying a user interface for allowing a user to suspend and resume the clustering.
  - 3. The method of claim 1 additionally comprising the step of providing a user interface wherein the user can adjust parameters of the clustering analysis.
  - 4. The method of claim 3 wherein the parameters include a stopping criteria.

5. A computer readable medium having stored thereon a data structure, comprising:
- a) a first data portion containing a model representation of data stored on a database wherein the model representation includes multiple clusters defined by a vector containing a summation of data records from the database having vector components corresponding to attributes of said data records;
  
  b) a second data portion containing sufficient statistics of a portion of the data in the database which summarizes multiple clusters defined by a vector containing a summation of summarized data records from the database having vector components which are attributes of said data records;
  
  c) a third data portion containing individual data records obtained from the database for use with the sufficient statistics to determine said model representation contained in the first data portion; and
  
  d) wherein said sufficient statistics from the second data portion and said individual data records from the third data portion each contribute to said model representation contained in said first data portion.
- View Dependent Claims (6, 7)
- - 6. The data structure of claim 5 wherein the second data portion includes a first subportion that is organized in accordance with data clusters which characterize the database and a second subportion which is not organized according to cluster.
  - 7. The data structure of claim 5 wherein the data associated with a data cluster includes the number of data records from the database associated with said cluster.

8. In a computer data mining system, apparatus for evaluating data in a database comprising:
- a) one or more data storage devices for storing a database of records on a storage medium;
  
  b) a computer having an interface to the data storage devices for reading data from the storage medium and a computer rapid access memory for storing data during subsequent data clustering evaluation; and
  
  c) said computer comprising a processing unit for evaluating at least some of the data in the database and for characterizing the data into multiple numbers of data clusters;
  
  said processing unit programmed to retrieve a subset of data from the database into the rapid access memory, evaluate the subset of data to produce a clustering model that characterizes the database using a clustering criteria, compressing at least some of the retrieved data by summarizing and storing the data in a compressed form separate from the clustering model for use in subsequent refinement of the clustering model and retrieve additional data from the database to update the clustering model based on the summarization of the previously retrieved data and the newly retrieved additional data.
- View Dependent Claims (9, 10, 11, 12, 13)
- - 9. The apparatus of claim 8 wherein the processing unit further comprises means to iteratively bring data from the database and update the characterization of data into clusters until a specified criteria has been reached.
  - 10. The apparatus of claim 8 additionally comprising a video monitor to depict a user interface which updates a user regarding a status of the classification of data and including a command input for allowing a user to suspend or to stop the process.
  - 11. The apparatus of claim 10 wherein the computer presents a user interface on the video monitor for allowing a user to adjust parameters of the clustering process.
  - 12. The apparatus of claim 11 wherein the user interface allows a user to adjust a depiction or organization of data from the database based upon a cluster number to which data is assigned during a clustering process.
  - 13. The apparatus of claim 8 wherein the computer updates multiple clustering models and includes multiple processors for updating said multiple clustering models.

14. In a computer data mining system, a method for evaluating data in a database that is stored on a storage medium to produce multiple clustering models in one scan of the database or less comprising the steps of:
- a) initializing multiple storage areas for storing multiple cluster models of the data in the database;
  
  b) obtaining a portion of the data in the database from a storage medium;
  
  c) using a clustering criteria to characterize a clustering of data from the portion of data obtained from the database for each model;
  
  d) compressing at least some of the data contained within the portion of data based upon a compression criteria to produce for storage separate from the clustering models sufficient statistics for the data satisfying the compression criteria for use in refining the clustering models; and
  
  e) continuing to obtain portions of data from the database and characterizing the clustering of data in the database from newly sampled data and the stored sufficient statistics for each of the multiple cluster models until a specified clustering criteria has been satisfied for one or more of the multiple cluster models.
- View Dependent Claims (15, 16, 17, 18, 19, 20)
- - 15. The method of claim 14 wherein a portion of the sufficient statistics is unique for each of the clustering models and wherein a portion of the sufficient statistics is shared between different clustering models.
  - 16. The method of claim 14 wherein the specified criteria is reached when iterative solutions for one of the models does not vary by more that a predetermined amount.
  - 17. The method of claim 14 wherein the specified criteria is reached when iterative solutions for a specified number of the models do not vary by more that a predetermined amount.
  - 18. The method of claim 14 wherein the multiple models are generating from multiple different starting conditions.
  - 19. The method of claim 18 wherein at least some of the models have different numbers of clusters.
  - 20. The method of claim 14 wherein multiple clustering models are generated in an inner processing loop and wherein an outer processing loop iterates over different values of cluster number.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Bradley, Paul S., Reina, Cory, Fayyad, Usama
Primary Examiner(s)
Black, Thomas
Assistant Examiner(s)
LE, UYEN T

Application Number

US09/040,219
Time in Patent Office

1,491 Days
Field of Search

707/102, 707/101, 707/1, 707/5, 345/772, 345/764, 345/859, 706/11
US Class Current

1/1
CPC Class Codes

G06F 16/35   Clustering; Classification

G06F 18/2321   using statistics or functio...

G06F 2216/03   Data mining

Y10S 707/99942   Manipulating data structure...

Scalable system for clustering of large databases

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

220 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Scalable system for clustering of large databases

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

220 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links