Number of clusters estimation

US 9,424,337 B2
Filed: 03/04/2014
Issued: 08/23/2016
Est. Priority Date: 07/09/2013
Status: Active Grant

First Claim

Patent Images

1. A non-transitory computer-readable medium having stored thereon computer-readable instructions that when executed by a computing device cause the computing device to:

receive data to cluster;

define a number of clusters to create;

(a) determine centroid locations for the defined number of clusters using a clustering algorithm and the received data to define clusters;

(b) define boundaries for each of the defined clusters bydetermining an eigenvector and an eigenvalue for each dimension of each cluster of the defined clusters using principal components analysis;

determining a length for each dimension of each cluster as a proportion of the determined eigenvalue for the respective dimension; and

defining the boundaries for each cluster of the defined clusters as a box with a center of the box as the determined centroid location of the respective cluster, a first boundary point for each dimension defined as the center plus the determined length of the respective dimension aligned with the determined eigenvector of the respective dimension, and a second boundary point for each dimension defined as the center minus the determined length of the respective dimension aligned with the eigenvector of the respective dimension;

(c) create a reference distribution that includes a plurality of data points, wherein the plurality of data points are within the defined boundary of at least one cluster of the defined clusters;

(d) determine second centroid locations for the defined number of clusters using the clustering algorithm and the created reference distribution to define second clusters;

(e) compute a gap statistic for the defined number of clusters based on a comparison between a first residual sum of squares computed for the defined clusters and a second residual sum of squares computed for the defined second clusters;

(f) repeat (a) to (e) with a next number of clusters to create as the defined number of clusters; and

(g) determine an estimated best number of clusters for the received data by comparing the gap statistic computed for each iteration of (e).

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method of determining a number of clusters for a dataset is provided. Centroid locations for a defined number of clusters are determined using a clustering algorithm. Boundaries for each of the defined clusters are defined. A reference distribution that includes a plurality of data points is created. The plurality of data points are within the defined boundary of at least one cluster of the defined clusters. Second centroid locations for the defined number of clusters are determined using the clustering algorithm and the reference distribution. A gap statistic for the defined number of clusters based on a comparison between a first residual sum of squares and a second residual sum of squares is computed. The processing is repeated for a next number of clusters to create. An estimated best number of clusters for the received data is determined by comparing the gap statistic computed for each iteration of the number of clusters.

37 Citations

View as Search Results

25 Claims

1. A non-transitory computer-readable medium having stored thereon computer-readable instructions that when executed by a computing device cause the computing device to:
- receive data to cluster;
  
  define a number of clusters to create;
  
  (a) determine centroid locations for the defined number of clusters using a clustering algorithm and the received data to define clusters;
  
  (b) define boundaries for each of the defined clusters bydetermining an eigenvector and an eigenvalue for each dimension of each cluster of the defined clusters using principal components analysis;
  
  determining a length for each dimension of each cluster as a proportion of the determined eigenvalue for the respective dimension; and
  
  defining the boundaries for each cluster of the defined clusters as a box with a center of the box as the determined centroid location of the respective cluster, a first boundary point for each dimension defined as the center plus the determined length of the respective dimension aligned with the determined eigenvector of the respective dimension, and a second boundary point for each dimension defined as the center minus the determined length of the respective dimension aligned with the eigenvector of the respective dimension;
  
  (c) create a reference distribution that includes a plurality of data points, wherein the plurality of data points are within the defined boundary of at least one cluster of the defined clusters;
  
  (d) determine second centroid locations for the defined number of clusters using the clustering algorithm and the created reference distribution to define second clusters;
  
  (e) compute a gap statistic for the defined number of clusters based on a comparison between a first residual sum of squares computed for the defined clusters and a second residual sum of squares computed for the defined second clusters;
  
  (f) repeat (a) to (e) with a next number of clusters to create as the defined number of clusters; and
  
  (g) determine an estimated best number of clusters for the received data by comparing the gap statistic computed for each iteration of (e).
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
- - 2. The computer-readable medium of claim 1, wherein the gap statistic for the defined number of clusters is computed using gap(k)=log (W_k*)−
    - log(W_k), where
  - 3. The computer-readable medium of claim 2, wherein (c) and (d) are repeated a plurality of times and wherein
  - 4. The computer-readable medium of claim 2, wherein n_j* is selected based on n_jfor cluster j.
  - 5. The computer-readable medium of claim 1, wherein the clustering algorithm is a k-means algorithm.
  - 6. The computer-readable medium of claim 1, wherein the clustering algorithm is a Ward'"'"'s minimum-variance algorithm.
  - 7. The computer-readable medium of claim 1, wherein the number of clusters to create is defined as a minimum number of clusters in a range of numbers of clusters to evaluate.
  - 8. The computer-readable medium of claim 7, wherein the next number of clusters is defined in (f) by incrementing the defined number of clusters for each iteration of (f).
  - 9. The computer-readable medium of claim 8, wherein (f) is repeated until the next number of clusters is greater than a maximum number of clusters in the range of numbers of clusters to evaluate.
  - 10. The computer-readable medium of claim 1, wherein the plurality of data points are created from a uniform distribution defined within the defined boundary of at least one cluster of the defined clusters.
  - 11. The computer-readable medium of claim 1, wherein the proportion of the determined eigenvalue for the respective dimension is between 0.75 and 1.0.
  - 12. The computer-readable medium of claim 1, wherein the estimated best number of clusters is determined as the defined number of clusters associated with a maximum value of the computed gap statistic.
  - 13. The computer-readable medium of claim 1, wherein the estimated best number of clusters is determined as the defined number of clusters associated with a first local maxima value of the computed gap statistic.
  - 14. The computer-readable medium of claim 1, wherein the computer-readable instructions further cause the computing device to:
    - after (d) and before (f), compute a standard deviation of the second residual sum of squares; and
      
      determine an error gap as a difference between the computed gap statistic and the computed standard deviation,wherein the estimated best number of clusters is determined as the defined number of clusters associated with a minimum defined number of clusters for which the computed gap statistic for that cluster is greater than the determined error gap of a subsequent cluster.
  - 15. The computer-readable medium of claim 1, wherein the computer-readable instructions further cause the computing device to:
    - after (d) and before (f),compute a standard deviation of the second residual sum of squares; and
      
      determine an error gap as a difference between the computed gap statistic and the computed standard deviation; and
      
      after (d) and before (g),determine a first number of clusters as the defined number of clusters associated with a first local maxima value of the computed gap statistic; and
      
      determine a second number of clusters as the defined number of clusters associated with a maximum value of the computed gap statistic;
      
      determine a third number of clusters as the defined number of clusters associated with a minimum defined number of clusters for which the computed gap statistic for that cluster is greater than the determined error gap of a subsequent cluster,wherein the estimated best number of clusters is determined as the determined first number of clusters unless the determined second number of clusters equals the determined third number of clusters in which case the estimated best number of clusters is determined as the determined second number of clusters.

16. A computing device comprising:
- a processor; and
  
  a non-transitory computer-readable medium operably coupled to the processor, the computer-readable medium having computer-readable instructions stored thereon that, when executed by the processor, cause the computing device toreceive data to cluster;
  
  define a number of clusters to create;
  
  (a) determine centroid locations for the defined number of clusters using a clustering algorithm and the received data to define clusters;
  
  (b) define boundaries for each of the defined clusters bydetermining an eigenvector and an eigenvalue for each dimension of each cluster of the defined clusters using principal components analysis;
  
  determining a length for each dimension of each cluster as a proportion of the determined eigenvalue for the respective dimension; and
  
  defining the boundaries for each cluster of the defined clusters as a box with a center of the box as the determined centroid location of the respective cluster, a first boundary point for each dimension defined as the center plus the determined length of the respective dimension aligned with the determined eigenvector of the respective dimension, and a second boundary point for each dimension defined as the center minus the determined length of the respective dimension aligned with the eigenvector of the respective dimension;
  
  (c) create a reference distribution that includes a plurality of data points, wherein the plurality of data points are within the defined boundary of at least one cluster of the defined clusters;
  
  (d) determine second centroid locations for the defined number of clusters using the clustering algorithm and the created reference distribution to define second clusters;
  
  (e) compute a gap statistic for the defined number of clusters based on a comparison between a first residual sum of squares computed for the defined clusters and a second residual sum of squares computed for the defined second clusters;
  
  (f) repeat (a) to (e) with a next number of clusters to create as the defined number of clusters; and
  
  (g) determine a cluster number for the received data by comparing the gap statistic computed for each iteration of (e).
- View Dependent Claims (17, 18)
- - 17. The computing device of claim 16, wherein the plurality of data points are created from a uniform distribution defined within the defined boundary of at least one cluster of the defined clusters.
  - 18. The computing device of claim 16, wherein the proportion of the determined eigenvalue for the respective dimension is between 0.75 and 1.0.

19. A system comprising:
- a first computing device comprisinga first processor; and
  
  a first computer-readable medium operably coupled to the first processor, the first computer-readable medium having first computer-readable instructions stored thereon that, when executed by the first processor, cause the first computing device to;
  
  receive data to cluster;
  
  define a number of clusters to create;
  
  (a) determine centroid locations for the defined number of clusters using a clustering algorithm and the received data to define clusters;
  
  (b) define boundaries for each of the defined clusters bydetermining an eigenvector and an eigenvalue for each dimension of each cluster of the defined clusters using principal components analysis;
  
  determining a length for each dimension of each cluster as a proportion of the determined eigenvalue for the respective dimension; and
  
  defining the boundaries for each cluster of the defined clusters as a box with a center of the box as the determined centroid location of the respective cluster, a first boundary point for each dimension defined as the center plus the determined length of the respective dimension aligned with the determined eigenvector of the respective dimension, and a second boundary point for each dimension defined as the center minus the determined length of the respective dimension aligned with the eigenvector of the respective dimension;
  
  (c) send a request to a second computing device to define second clusters based on the defined boundaries;
  
  (d) receive a first residual sum of squares computed for the defined second clusters;
  
  (e) compute a gap statistic for the defined number of clusters based on a comparison between a second residual sum of squares computed for the defined clusters and the first residual sum of squares computed for the defined second clusters;
  
  (f) repeat (a) to (e) with a next number of clusters to create as the defined number of clusters; and
  
  (g) determine a cluster number for the received data by comparing the gap statistic computed for each iteration of (e); and
  
  the second computing device comprisinga second processor; and
  
  a second computer-readable medium operably coupled to the second processor, the second computer-readable medium having second computer-readable instructions stored thereon that, when executed by the second processor, cause the second computing device to;
  
  receive a request from the first computing device to define second clusters based on the defined boundaries;
  
  create a reference distribution that includes a plurality of data points, wherein the plurality of data points are within the defined boundary of at least one cluster of the defined clusters;
  
  determine second centroid locations for the defined number of clusters using the clustering algorithm and the created reference distribution to define second clusters;
  
  compute the first residual sum of squares for the defined second clusters; and
  
  send the computed first residual sum of squares to the first computing device.
- View Dependent Claims (20, 21, 22)
- - 20. The system of claim 19, wherein the first computing device is a grid control device and the second computing device is a grid node of a plurality of grid nodes controlled by the grid control device.
  - 21. The system of claim 19, wherein the plurality of data points are created from a uniform distribution defined within the defined boundary of at least one cluster of the defined clusters.
  - 22. The system of claim 19, wherein the proportion of the determined eigenvalue for the respective dimension is between 0.75 and 1.0.

23. A method of determining a number of clusters for a dataset, the method comprising:
- receiving data to cluster;
  
  defining a number of clusters to create;
  
  (a) determining centroid locations for the defined number of clusters using a clustering algorithm and the received data to define clusters;
  
  (b) defining boundaries for each of the defined clusters bydetermining an eigenvector and an eigenvalue for each dimension of each cluster of the defined clusters using principal components analysis;
  
  determining a length for each dimension of each cluster as a proportion of the determined eigenvalue for the respective dimension; and
  
  defining the boundaries for each cluster of the defined clusters as a box with a center of the box as the determined centroid location of the respective cluster, a first boundary point for each dimension defined as the center plus the determined length of the respective dimension aligned with the determined eigenvector of the respective dimension, and a second boundary point for each dimension defined as the center minus the determined length of the respective dimension aligned with the eigenvector of the respective dimension;
  
  (c) creating a reference distribution that includes a plurality of data points, wherein the plurality of data points are within the defined boundary of at least one cluster of the defined clusters;
  
  (d) determining second centroid locations for the defined number of clusters using the clustering algorithm and the created reference distribution to define second clusters;
  
  (e) computing, by a computing device, a gap statistic for the defined number of clusters based on a comparison between a first residual sum of squares computed for the defined clusters and a second residual sum of squares computed for the defined second clusters;
  
  (f) repeating, by the computing device, (a) to (e) with a next number of clusters to create as the defined number of clusters; and
  
  (g) determining, by the computing device, an estimated best number of clusters for the received data by comparing the gap statistic computed for each iteration of (e).
- View Dependent Claims (24, 25)
- - 24. The method of claim 23, wherein the plurality of data points are created from a uniform distribution defined within the defined boundary of at least one cluster of the defined clusters.
  - 25. The method of claim 23, wherein the proportion of the determined eigenvalue for the respective dimension is between 0.75 and 1.0.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
SAS Institute Incorporated
Original Assignee
SAS Institute Incorporated
Inventors
Hall, Patrick, Kaynar Kabul, Ilknur, Sarle, Warren, Silva, Jorge
Primary Examiner(s)
Hwa, Shyue Jiunn

Application Number

US14/196,299
Publication Number

US 20150019554A1
Time in Patent Office

903 Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 16/285 Clustering or classification

G06F 18/23211 with adaptive number of clu...

Number of clusters estimation

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

37 Citations

25 Claims

Specification

Solutions

Use Cases

Quick Links

Number of clusters estimation

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

37 Citations

25 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links