Scalable system for K-means clustering of large databases

US 6,012,058 A
Filed: 03/17/1998
Issued: 01/04/2000
Est. Priority Date: 03/17/1998
Status: Expired due to Term

First Claim

Patent Images

1. In a computer data processing system, a method for clustering data in a database comprising the steps of:

a) choosing a cluster number K for use in categorizing the data in the database into K different clusters;

b) accessing data records from a database and bringing a data portion into a rapid access memory;

c) assigning data records from the data portion to one of the K different clusters and determining a mean of the data records assigned to a given cluster;

d) summarizing at least some of the data assigned to the clusters, storing a summarization of the data within the rapid access memory;

e) accessing an additional portion of the data records in the database and bringing said additional portion into the rapid access memory;

f) again assigning data from the database to a cluster and determining an updated mean from the summarized data and the additional portion of data records; and

g) evaluating a criteria to determine if further data should be accessed from the database to continue clustering of data from the database.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

In one exemplary embodiment the invention provides a data mining system for use in evaluating data in a database. Before the data evaulation begins a choice is made of a cluster number K for use in categorizing the data in the database into K different clusters and initial guesses at the means, or centriods, of each cluster are provided. Then a portion of the data in the database is read from a storage medium and brought into a rapid access memory. Data contained in the data portion is used to update the original guesses at the centroids of each of the K clusters. Some of the data belonging to a cluster is summarized or compressed and stored as a summarization of the data. More data is accessed from the database and assigned to a cluster. An updated mean for the clusters is determined from the summarized data and the newly acquired data. A stopping criteria is evaluated to determine if further data should be accessed from the database. If further data is needed to characterize the clusters, more data is gathered from the database and used in combination with already compressed data until the stopping criteria has been met.

Citations

32 Claims

1. In a computer data processing system, a method for clustering data in a database comprising the steps of:
- a) choosing a cluster number K for use in categorizing the data in the database into K different clusters;
  
  b) accessing data records from a database and bringing a data portion into a rapid access memory;
  
  c) assigning data records from the data portion to one of the K different clusters and determining a mean of the data records assigned to a given cluster;
  
  d) summarizing at least some of the data assigned to the clusters, storing a summarization of the data within the rapid access memory;
  
  e) accessing an additional portion of the data records in the database and bringing said additional portion into the rapid access memory;
  
  f) again assigning data from the database to a cluster and determining an updated mean from the summarized data and the additional portion of data records; and
  
  g) evaluating a criteria to determine if further data should be accessed from the database to continue clustering of data from the database.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
- - 2. The method of claim 1 wherein an extended K-means evaluation of the data records and the summarization of data is used to calculate a clustering model that includes a mean for each of the K different clusters in one or less scans of a database and wherein said model is then used as a starting point for further clustering of the database by an alternate clustering process.
  - 3. The method of claim 1 wherein the step of summarizing data includes the substep of identifying data that can be summarized as contributions to a specified one of the K clusters.
  - 4. The method of claim 1 wherein the data records of the database are vectors and the step of summarizing data includes the substep of identifying a discard set of data records that can be summarized as contributions to a specified one of the K clusters and a further step of clustering of some of the data records to produce subclusters of data records for which sufficient statistics are stored.
  - 5. The method of claim 4 wherein the step of again calculating the means of the clusters is performed by adding the sufficient statistics for a subcluster to a closest one of said K clusters.
  - 6. The method of claim 5 wherein the sufficient statistics is derived from compression of data from individual data records falling within a confidence interval in a region of a mean of a cluster of said set of K clusters.
  - 7. The method of claim 5 wherein the sufficient statistics is derived by compressing data from individual data records that are within a specified distance of a mean of one of the K data clusters.
  - 8. The method of claim 5 wherein the sufficient statistics is derived by compressing data from individual data records falling below a threshold of data records within a specified range of a mean of one of the K data clusters.
  - 9. The method of claim 4 wherein the sufficient statistics is derived from either a compression of data records into a single data structure or the creation of multiple data sub-clusters from additional K-means processing of data records.
  - 10. The method of claim 1 wherein the step of summarizing the data includes the substep of classifying the data into at least two groups and wherein data in a first group is compressed and data in a second group is maintained as data records of the database.
  - 11. The method of claim 10 wherein the first group of data comprises two subgroups of data are summarized by compressing and wherein the data in one of the subgroups is classified into a group of subclusters that form a dense enough clustering of data from the database.
  - 12. The method of claim 11 wherein a model of the clustering is updated each time a portion of the data is accessed from the database and wherein at least some of the data records that are read into the rapid access memory are combined with sufficient statistics of the subclusters before using other data records read from the database to form new subclusters.
  - 13. The method of claim 11 additionally comprising the step of combining subclusters of data so long as the combination of subclusters produces a resultant subcluster that meets a denseness criteria.
  - 14. The method of claim 1 wherein the step of characterizing clustering of data within the database comprises the steps of a) choosing an initial centroid for each of the K data clusters;
    - b) assigning data records to a cluster based on nearness to a cluster centroid; and
      
      c) updating the centroids for clusters based on the data from the database.
  - 15. The method of claim 1 wherein the specified criteria stops further characterization based on a comparison of different database models derived from data obtained from the database and the characterization is stopped when a change in said models is less than a specified stopping criteria.
  - 16. The method of claim 1 wherein the specified criteria suspends the characterization of the clustering to allow the characterization to be resumed at a later time.

17. A computer readable medium having stored thereon a data structure, comprising:
- a) a first data portion containing a model representation of data stored on a database wherein the model includes K data clusters and wherein the model includes a mean for each cluster and a number of data point assigned to each cluster;
  
  b) a second data portion containing sufficient statistics of a portion of the data in the database; and
  
  c) a third data portion containing individual data records obtained from the database for use with the sufficient statistics to determine said model representation contained in the first data portion.
- View Dependent Claims (18, 19)
- - 18. The data structure of claim 17 wherein the model representation associated with a data cluster is a vector containing a summation of data records from the database wherein the vector components are attributes of said data records and further wherein the data associated with a data cluster includes the number of data records from the database associated with said cluster.
  - 19. The data structure of claim 18 wherein the model representation associated with a data cluster contains an additional vector containing a summation of a squaring of each attribute of a data records assigned to an associated data cluster.

20. In a computer data mining system, apparatus for evaluating data in a database comprising:
- a) one or more data storage devices for storing a database of records on a storage medium;
  
  b) a computer having an interface to the storage devices for reading data from the storage medium and store the data into a rapid access memory for subsequent evaluation; and
  
  c) said computer comprising a processing unit for evaluating at least some of the data in the database and for characterizing the data into multiple numbers of data clusters;
  
  said processing unit programmed to retrieve a subset of data from the database into the rapid access memory, evaluate the subset of data by assigning data records to one of the multiple number of data clusters, and produce a summarization of at least some of the retrieved data before retrieving additional data from the database.
- View Dependent Claims (21, 22, 23, 24, 25)
- - 21. The apparatus of claim 20 wherein the processing unit comprises means for performing an extended K-means analysis on a portion of data retrieved from the database and the summarization of another portion of data retrieved from the database.
  - 22. The apparatus of claim 21 wherein the processing unit further comprises means to subclassify data retrieved from the database into the rapid access memory and wherein some of the data is compressed and by means of further K-means processing to further define the clustering of data within the database.
  - 23. The apparatus of claim 20 wherein the processing unit further comprises means to iteratively bring data from the database and update the characterization of data into clusters until a specified criteria has been reached.
  - 24. The apparatus of claim 20 additional comprising a user interface which updates a user regarding a status of the classification of data and including input means to allow the user to suspend or to stop the process.
  - 25. The apparatus of claim 20 wherein the computer updates multiple clustering models and includes multiple processors for updating said multiple clustering models.

26. In a computer data mining system, a method for use in evaluating data in a database comprising the steps of:
- a) choosing a cluster number K for use in categorizing the data in the database into K different data clusters;
  
  b) choosing an initial centroid for each of the K data clusters;
  
  c) sampling a portion of the data in the database from a storage medium to bring said portion of data within a rapid access memory;
  
  d) assigning individual data records contained within the portion of data to a cluster based on nearness to a cluster centroid and updating the centroids for the clusters based on the data from the database to define a clustering model;
  
  e) compressing some of the data into data records of sufficient statistics for evaluating a clustering model; and
  
  f) continuing to sample data from the database, assigning data from the database to a cluster and determining an updated centroid from the data and the sufficient statistics until an evaluating criteria has been satisfied indicating the centroids have been adequately determined from a sampling of a subset of data within the database or until all data within the database has been evaluated.
- View Dependent Claims (27)
- - 27. The method of claim 26 wherein a confidence interval is determined for each of the K clusters and wherein data records for each of the K clusters are compressed if perturbing the centroid of all K clusters by an amount equal to the confidence interval does not change the assignment of a record to its associated cluster.

28. In a computer data mining system, a method for evaluating data in a database that is stored on a storage medium comprising the steps of:
- a) initializing multiple storage areas for storing multiple cluster models of the data in the database;
  
  b) obtaining a portion of the data in the database from a storage medium and assigning data records to the clusters of the multiple cluster models;
  
  c) using a clustering criteria to characterize a clustering of data from the portion of data obtained from the database for each model;
  
  d) summarizing at least some of the data contained within the portion of data based upon a compression criteria to produce sufficient statistics for the data satisfying the compression criteria; and
  
  e) continuing to obtain portions of data from the database and characterizing the clustering of data in the database from newly sampled data and the sufficient statistics for each of the multiple cluster models until a specified criteria has been satisfied.
- View Dependent Claims (29, 30, 31)
- - 29. The method of claim 28 wherein a portion of the sufficient statistics is unique for each of the clustering models and wherein a portion of the sufficient statistics is shared between different clustering models.
  - 30. The method of claim 28 wherein the specified criteria is reached when iterative solutions for one of the models does not vary by more that a predetermined amount.
  - 31. The method of claim 28 wherein the specified criteria is reached when iterative solutions for a specified number of the models do not vary by more that a predetermined amount.

32. A computer readable medium for storing program instructions for performing the steps of:
- a) choosing a cluster number K for use in categorizing the data in the database into K different data clusters;
  
  b) choosing an initial centroid for each of the K data clusters;
  
  c) sampling a portion of the data in the database from a storage medium to bring said portion of data within a rapid access memory;
  
  d) assigning individual data records contained within the portion of data to a cluster based on nearness to a cluster centroid and updating the centroids for the clusters based on the data from the database to define a clustering model;
  
  e) compressing some of the data into data records of sufficient statistics for evaluating a clustering model; and
  
  f) continuing to sample data from the database, assigning data from the database to a cluster and determining an updated centroid from the data and the sufficient statistics until an evaluating criteria has been satisfied indicating the centroids have been adequately determined from a sampling of a subset of data within the database or until all data within the database has been evaluated.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Fayyad, Usama, Bradley, Paul S., Reina, Cory
Primary Examiner(s)
Black, Thomas G.
Assistant Examiner(s)
JUNG, DAVID YIUK

Application Number

US09/042,540
Time in Patent Office

658 Days
Field of Search

707/1-206
US Class Current

1/1
CPC Class Codes

G06F 16/35   Clustering; Classification

G06F 18/23213   with fixed number of cluste...

Y10S 707/99932   Access augmentation or opti...

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99934   Query formulation, input pr...

Y10S 707/99935   Query augmenting and refini...

Y10S 707/99936   Pattern matching access

Scalable system for K-means clustering of large databases

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

32 Claims

Specification

Solutions

Use Cases

Quick Links

Scalable system for K-means clustering of large databases

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

32 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links