×

Number of clusters estimation

  • US 9,424,337 B2
  • Filed: 03/04/2014
  • Issued: 08/23/2016
  • Est. Priority Date: 07/09/2013
  • Status: Active Grant
First Claim
Patent Images

1. A non-transitory computer-readable medium having stored thereon computer-readable instructions that when executed by a computing device cause the computing device to:

  • receive data to cluster;

    define a number of clusters to create;

    (a) determine centroid locations for the defined number of clusters using a clustering algorithm and the received data to define clusters;

    (b) define boundaries for each of the defined clusters bydetermining an eigenvector and an eigenvalue for each dimension of each cluster of the defined clusters using principal components analysis;

    determining a length for each dimension of each cluster as a proportion of the determined eigenvalue for the respective dimension; and

    defining the boundaries for each cluster of the defined clusters as a box with a center of the box as the determined centroid location of the respective cluster, a first boundary point for each dimension defined as the center plus the determined length of the respective dimension aligned with the determined eigenvector of the respective dimension, and a second boundary point for each dimension defined as the center minus the determined length of the respective dimension aligned with the eigenvector of the respective dimension;

    (c) create a reference distribution that includes a plurality of data points, wherein the plurality of data points are within the defined boundary of at least one cluster of the defined clusters;

    (d) determine second centroid locations for the defined number of clusters using the clustering algorithm and the created reference distribution to define second clusters;

    (e) compute a gap statistic for the defined number of clusters based on a comparison between a first residual sum of squares computed for the defined clusters and a second residual sum of squares computed for the defined second clusters;

    (f) repeat (a) to (e) with a next number of clusters to create as the defined number of clusters; and

    (g) determine an estimated best number of clusters for the received data by comparing the gap statistic computed for each iteration of (e).

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×