Computerized cluster analysis framework for decorrelated cluster identification in datasets

US 9,202,178 B2
Filed: 12/02/2014
Issued: 12/01/2015
Est. Priority Date: 03/11/2014
Status: Active Grant

First Claim

Patent Images

1. A non-transitory computer-readable medium having stored thereon computer-readable instructions that when executed by a computing device cause the computing device to:

receive data that includes a plurality of observations with a plurality of data points defined for each observation, wherein each data point of the plurality of data points is associated with a variable to define a plurality of variables;

repeatedly select a number of clusters into which to segment the received data by repeatedly executing a clustering algorithm with the received data;

define a plurality of sets of clusters based on the repeated execution of the clustering algorithm that resulted in the selected number of clusters;

define a plurality of composite clusters based on the defined plurality of sets of clusters; and

assign the plurality of observations to the defined plurality of composite clusters using the plurality of data points defined for each observation.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A computing device to automatically cluster a dataset is provided. Data that includes a plurality of observations with a plurality of data points defined for each observation is received. Each data point of the plurality of data points is associated with a variable to define a plurality of variables. A number of clusters into which to segment the received data is repeatedly selected by repeatedly executing a clustering algorithm with the received data. A plurality of sets of clusters is defined based on the repeated execution of the clustering algorithm that resulted in the selected number of clusters. A plurality of composite clusters is defined based on the defined plurality of sets of clusters. The plurality of observations is assigned to the defined plurality of composite clusters using the plurality of data points defined for each observation.

47 Citations

View as Search Results

30 Claims

1. A non-transitory computer-readable medium having stored thereon computer-readable instructions that when executed by a computing device cause the computing device to:
- receive data that includes a plurality of observations with a plurality of data points defined for each observation, wherein each data point of the plurality of data points is associated with a variable to define a plurality of variables;
  
  repeatedly select a number of clusters into which to segment the received data by repeatedly executing a clustering algorithm with the received data;
  
  define a plurality of sets of clusters based on the repeated execution of the clustering algorithm that resulted in the selected number of clusters;
  
  define a plurality of composite clusters based on the defined plurality of sets of clusters; and
  
  assign the plurality of observations to the defined plurality of composite clusters using the plurality of data points defined for each observation.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28)
- - 2. The computer-readable medium of claim 1, wherein selecting the number of clusters comprises:
    - defining a test number of clusters to create;
      
      (a) determining centroid locations for the defined test number of clusters using the clustering algorithm and the received data to define test clusters;
      
      (b) creating a reference distribution that includes a plurality of reference data points, wherein the plurality of reference data points are within a boundary defined for the received data;
      
      (c) determining second centroid locations for the defined test number of clusters using the clustering algorithm and the created reference distribution to define second test clusters;
      
      (d) computing a first residual sum of squares for the defined test clusters;
      
      (e) computing a second residual sum of squares for the defined second test clusters;
      
      (f) computing a gap statistic for the defined test number of clusters based on a comparison between the computed first residual sum of squares and the computed second residual sum of squares;
      
      (g) repeating (a) to (f) with a next test number of clusters to create as the defined test number of clusters;
      
      determining an estimated best number of clusters for the received data by comparing the gap statistic computed for each iteration of (d); and
      
      selecting the number of clusters as the determined estimated best number of clusters.
  - 3. The computer-readable medium of claim 2, wherein the boundary includes a cluster boundary for each of the defined test clusters and the plurality of reference data points are within the cluster boundary of at least one cluster of the defined test clusters.
  - 4. The computer-readable medium of claim 2, wherein (b) and (c) are repeated a plurality of times.
  - 5. The computer-readable medium of claim 2, wherein the test number of clusters to create is defined as a minimum number of clusters in a range of numbers of clusters to evaluate, and the next number of clusters is defined in (g) by incrementing the defined test number of clusters for each iteration of (g).
  - 6. The computer-readable medium of claim 5, wherein (g) is repeated until the next number of clusters is greater than a maximum number of clusters in the range of numbers of clusters to evaluate.
  - 7. The computer-readable medium of claim 2, wherein the estimated best number of clusters is determined as the defined number of clusters associated with a maximum value of the computed gap statistic or with a first local maxima value of the computed gap statistic.
  - 8. The computer-readable medium of claim 1, wherein repeatedly selecting the number of clusters comprises:
    - randomly selecting a first subset of the plurality of variables;
      
      selecting a first number of clusters into which to segment the received data by repeatedly executing the clustering algorithm with the received data using only the data points associated with the randomly selected first subset of the plurality of variables;
      
      randomly selecting a second subset of the plurality of variables that is different from the first subset of the plurality of variables; and
      
      selecting a second number of clusters into which to segment the received data by repeatedly executing the clustering algorithm with the received data using only the data points associated with the randomly selected second subset of the plurality of variables.
  - 9. The computer-readable medium of claim 8, wherein the random selection of the second subset of the plurality of variables and the selection of the second number of clusters is repeated a predefined number of times.
  - 10. The computer-readable medium of claim 9, wherein the number of clusters is selected from the selected first number of clusters and the repeated selections of the second number of clusters.
  - 11. The computer-readable medium of claim 1, wherein the computer-readable instructions further cause the computing device to select a plurality of decorrelated variables from the plurality of variables.
  - 12. The computer-readable medium of claim 11, wherein selecting the plurality of decorrelated variables comprises:
    - computing a correlation value between each of the plurality of variables to define a correlation matrix;
      
      comparing a binary threshold value to each correlation value to define a binary similarity matrix from the defined correlation matrix;
      
      defining an undirected graph comprising a subgraph that includes one or more connected nodes, wherein the undirected graph is defined based on the defined binary similarity matrix, wherein the undirected graph stores connectivity information for the plurality of variables, wherein each node of the subgraph is pairwise associated with a variable of the plurality of variables;
      
      selecting a least connected node from the defined undirected graph based on the connectivity information;
      
      removing the selected least connected node from the undirected graph; and
      
      outputting variables pairwise associated with remaining nodes of the undirected graph as the selected decorrelated variables when a stop criterion is satisfied.
  - 13. The computer-readable medium of claim 12, wherein selecting the least connected node and removing the selected least connected node are repeated a plurality of times.
  - 14. The computer-readable medium of claim 12, wherein the least connected node is selected randomly from a plurality of least connected nodes.
  - 15. The computer-readable medium of claim 14, wherein the random selection comprises comparing a randomly determined value to a predefined drop percentage value.
  - 16. The computer-readable medium of claim 12, wherein the connectivity information is updated after the selected least connected node is removed.
  - 17. The computer-readable medium of claim 16, wherein selecting the least connected node and removing the selected least connected node are repeated after updating the connectivity information.
  - 18. The computer-readable medium of claim 12, wherein the connectivity information comprises a connectivity counter value defined for each node in the undirected graph, wherein the connectivity counter value indicates a number of connections between the respective node and the remaining nodes.
  - 19. The computer-readable medium of claim 12, wherein the stop criterion is satisfied when the number of remaining nodes equals a predefined minimum number of nodes.
  - 20. The computer-readable medium of claim 12, wherein the stop criterion is satisfied when the number of remaining nodes equals a predefined percentage of one or more connected nodes included in the defined undirected graph.
  - 21. The computer-readable medium of claim 12, wherein the defined undirected graph includes a plurality of subgraphs, and the stop criterion is satisfied when the plurality of subgraphs each include a single node.
  - 22. The computer-readable medium of claim 1, wherein defining the plurality of composite clusters based on the defined plurality of sets of clusters comprises:
    - initializing composite cluster centroid locations for each composite cluster of the composite clusters pairwise with cluster centroid locations of a first set of clusters of the defined plurality of sets of clusters;
      
      selecting a second set of clusters of the defined plurality of sets of clusters;
      
      selecting second cluster centroid locations of the selected second set of clusters;
      
      computing distances pairwise between each pairing of the selected second cluster centroid locations and the composite cluster centroid locations;
      
      selecting an optimum pairing based on a minimum distance of the computed distances; and
      
      updating the composite cluster centroid locations based on the selected second cluster centroid locations and the selected optimum pairing.
  - 23. The computer-readable medium of claim 22, wherein defining the plurality of composite clusters based on the defined plurality of sets of clusters further comprises repeating, for each of the defined plurality of sets of clusters as the selected second set of clusters, the selection of the second cluster centroid locations, the computation of the distances pairwise, the selection of the optimum pairing, and the update of the composite cluster centroid locations.
  - 24. The computer-readable medium of claim 23, wherein defining the plurality of composite clusters based on the defined plurality of sets of clusters further comprises updating cluster assignments for the plurality of observations based on the selected optimum pairing, wherein the update of the cluster assignments is repeated for each of the defined plurality of sets of clusters as the selected second set of clusters.
  - 25. The computer-readable medium of claim 24, wherein defining the plurality of composite clusters based on the defined plurality of sets of clusters further comprises computing a probability of assigning each observation of the plurality of observations to each composite cluster of the composite clusters based on the updated cluster assignments.
  - 26. The computer-readable medium of claim 25, wherein assigning an observation of the plurality of observations to the defined plurality of composite clusters is based on the probability of assigning the observation to each composite cluster.
  - 27. The computer-readable medium of claim 23, wherein, after the repeating for each of the defined plurality of sets of clusters as the selected second set of clusters, assigning the plurality of observations to the defined plurality of composite clusters comprises, for each observation of the plurality of observations:
    - computing cluster distances between the plurality of data points of an observation and each of the composite cluster centroid locations;
      
      selecting a minimum distance of the computed cluster distances;
      
      selecting a minimum composite cluster associated with the selected minimum distance; and
      
      assigning the observation to the selected minimum composite cluster.
  - 28. The computer-readable medium of claim 23, wherein the computer-readable instructions further cause the computing device to:
    - store the selected second cluster centroid locations pairwise in association with each composite cluster of the composite clusters based on the selected optimum pairing before the repeating for each of the defined plurality of sets of clusters as the selected second set of clusters;
      
      repeating the storing of the selected second cluster centroid locations for each of the defined plurality of sets of clusters as the selected second set of clusters;
      
      wherein, after the repeating for each of the defined plurality of sets of clusters as the selected second set of clusters, the computer-readable instructions further cause the computing device to;
      
      create noised centroid location data from the stored selected second cluster centroid locations;
      
      train a multi-layer neural network with the created noised centroid location data;
      
      determine a projected centroid location as values of hidden units of a middle layer of the trained multi-layer neural network; and
      
      output the determined, projected centroid location in a graph.

29. A computing device comprising:
- a processor; and
  
  a non-transitory computer-readable medium operably coupled to the processor, the computer-readable medium having computer-readable instructions stored thereon that, when executed by the processor, cause the computing device toreceive data that includes a plurality of observations with a plurality of data points defined for each observation, wherein each data point of the plurality of data points is associated with a variable to define a plurality of variables;
  
  repeatedly select a number of clusters into which to segment the received data by repeatedly executing a clustering algorithm with the received data;
  
  define a plurality of sets of clusters based on the repeated execution of the clustering algorithm that resulted in the selected number of clusters;
  
  define a plurality of composite clusters based on the defined plurality of sets of clusters; and
  
  assign the plurality of observations to the defined plurality of composite clusters using the plurality of data points defined for each observation.

30. A method of automatically clustering a dataset, the method comprising:
- receiving data that includes a plurality of observations with a plurality of data points defined for each observation, wherein each data point of the plurality of data points is associated with a variable to define a plurality of variables;
  
  repeatedly selecting, by a computing device, a number of clusters into which to segment the received data by repeatedly executing a clustering algorithm with the received data;
  
  defining, by the computing device, a plurality of sets of clusters based on the repeated execution of the clustering algorithm that resulted in the selected number of clusters;
  
  defining, by the computing device, a plurality of composite clusters based on the defined plurality of sets of clusters; and
  
  assigning, by the computing device, the plurality of observations to the defined plurality of composite clusters using the plurality of data points defined for each observation.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
SAS Institute Incorporated
Original Assignee
SAS Institute Incorporated
Inventors
Hall, Patrick, Kaynar Kabul, Ilknur, Dean, Jared Langford, Abbey, Ralph, Haller, Susan, Silva, Jorge
Primary Examiner(s)
Chaki, Kakali
Assistant Examiner(s)
SITIRICHE, LUIS A

Application Number

US14/558,136
Publication Number

US 20150261846A1
Time in Patent Office

364 Days
Field of Search

None
US Class Current

1/1
CPC Class Codes

G06F 16/22   Indexing; Data structures t...

G06F 16/2425   Iterative querying; Query f...

G06F 16/248   Presentation of query results

G06F 16/284   Relational databases

G06F 16/285   Clustering or classification

G06F 16/34   Browsing; Visualisation the...

G06F 16/9024   Graphs; Linked lists G06F16...

G06F 17/18   for evaluating statistical ...

G06F 18/214   Generating training pattern...

G06F 18/232   Non-hierarchical techniques

G06F 18/24137   Distances to cluster centroïds

G06N 20/00   Machine learning

G06N 3/045   Combinations of networks

G06N 3/08   Learning methods

G06N 3/088   Non-supervised learning, e....

G06N 5/02   Knowledge representation; S...

Computerized cluster analysis framework for decorrelated cluster identification in datasets

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

47 Citations

30 Claims

Specification

Solutions

Use Cases

Quick Links

Computerized cluster analysis framework for decorrelated cluster identification in datasets

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

47 Citations

30 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links