Computer-implemented systems and methods for variable clustering in large data sets

US 8,190,612 B2
Filed: 12/17/2008
Issued: 05/29/2012
Est. Priority Date: 12/17/2008
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method for reducing dimensionality of a data set, comprising:

accessing, using one or more data processors, a data set including a plurality of observations, wherein each observation has an associated number of attributes, wherein each attribute has a corresponding value, and wherein a number of attributes represents a dimensionality;

generating, using the one or more data processors, a similarity matrix using the data set, wherein the similarity matrix identifies degrees of similarity among the attributes;

generating, using the one or more data processors, global clusters of attributes using the similarity matrix, wherein a global cluster includes a subset of the attributes, and wherein the attributes are grouped in the global clusters according to the degrees of similarity;

generating, using the one or more data processors, a global cluster structure using the global clusters of attributes, wherein generating the global cluster structure includes determining a component for each global cluster of attributes, and performing a latent variable technique using the components;

generating, using the one or more data processors, a sub-cluster structure using the global clusters of attributes, wherein generating a sub-cluster structure includes performing the latent variable technique or a different latent variable technique on each global cluster of attributes; and

combining, using the one or more data processors, the global cluster structure and the sub-cluster structure to generate a cluster structure that has a fewer number of attributes than the accessed data set, wherein the fewer number of attributes represents a reduced dimensionality.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Computer-implemented systems and methods are provided for creating a cluster structure from a data set containing input variables. Global clusters are created within a first stage, by computing a similarity matrix from the data set. A global cluster structure and sub-cluster structure are created within a second stage, where the global cluster structure and the sub-cluster structure are created using a latent variable clustering technique and the cluster structure output is generated by combining the created global cluster structure and the created sub-cluster structure.

16 Citations

View as Search Results

15 Claims

1. A computer-implemented method for reducing dimensionality of a data set, comprising:
- accessing, using one or more data processors, a data set including a plurality of observations, wherein each observation has an associated number of attributes, wherein each attribute has a corresponding value, and wherein a number of attributes represents a dimensionality;
  
  generating, using the one or more data processors, a similarity matrix using the data set, wherein the similarity matrix identifies degrees of similarity among the attributes;
  
  generating, using the one or more data processors, global clusters of attributes using the similarity matrix, wherein a global cluster includes a subset of the attributes, and wherein the attributes are grouped in the global clusters according to the degrees of similarity;
  
  generating, using the one or more data processors, a global cluster structure using the global clusters of attributes, wherein generating the global cluster structure includes determining a component for each global cluster of attributes, and performing a latent variable technique using the components;
  
  generating, using the one or more data processors, a sub-cluster structure using the global clusters of attributes, wherein generating a sub-cluster structure includes performing the latent variable technique or a different latent variable technique on each global cluster of attributes; and
  
  combining, using the one or more data processors, the global cluster structure and the sub-cluster structure to generate a cluster structure that has a fewer number of attributes than the accessed data set, wherein the fewer number of attributes represents a reduced dimensionality.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
- - 2. The method of claim 1, wherein the global clusters of attributes are homogeneous groups of attributes.
  - 3. The method of claim 1, wherein the similarity matrix is a distance matrix, correlation matrix, or a covariance matrix.
  - 4. The method of claim 1, wherein a pre-defined number of global clusters of attributes are generated.
  - 5. The method of claim 4, wherein the pre-defined number of global clusters of attributes is chosen based upon a pre-selected criterion.
  - 6. The method of claim 5, wherein the pre-selected criterion is a cubic clustering criterion (CCC).
  - 7. The method of claim 1,wherein latent variables are generated by the latent variable technique or the different latent variable technique;
    - andwherein the latent variables are used to generate the global cluster structure and the sub-cluster structure.
  - 8. The method of claim 1, wherein the components are principal components or centroid components.
  - 9. The method of claim 1, wherein the latent variable technique and the different latent variable technique include factor analysis, principal component analysis, or simple unweighted average of variables.
  - 10. The method of claim 1, wherein generating the sub-cluster structure includes determining a component for each sub-cluster of the global clusters of attributes.
  - 11. The method of claim 1,wherein the attributes of the accessed data set are independent variables for predicting a target within a prediction model;
    - andwherein the attributes of the combined cluster structure are independent variables for predicting the target within the prediction model.
  - 12. The method of claim 11, wherein the prediction model provides predictions of whether a customer is likely to purchase a product or service.
  - 13. The method of claim 11, further comprising:
    - using the cluster structure to perform a data mining operation.

14. A system for reducing dimensionality of a data set, comprising:
- one or more processors;
  
  one or more non-transitory computer-readable storage mediums containing instructions configured to cause the one or more processors to perform operations including;
  
  accessing a data set including a plurality of observations, wherein each observation has an associated number of attributes, wherein each attribute has a corresponding value, and wherein a number of attributes represents a dimensionality;
  
  generating a similarity matrix using the data set, wherein the similarity matrix identifies degrees of similarity among the attributes;
  
  generating global clusters of attributes using the similarity matrix, wherein a global cluster includes a subset of the attributes, and wherein the attributes are grouped in the global clusters according to the degrees of similarity;
  
  generating a global cluster structure using the global clusters of attributes, wherein generating the global cluster structure includes determining a component for each global cluster of attributes, and performing a latent variable technique using the components;
  
  generating a sub-cluster structure using the global clusters of attributes, wherein generating a sub-cluster structure includes performing the latent variable technique or a different latent variable technique on each global cluster of attributes; and
  
  combining the global cluster structure and the sub-cluster structure to generate a cluster structure that has a fewer number of attributes than the accessed data set, wherein the fewer number of attributes represents a reduced dimensionality.

15. A non-transitory computer program product for reducing dimensionality of a data set, tangibly embodied in a machine-readable non-transitory storage medium, including instructions configured to cause a data processing system to:
- access a data set including a plurality of observations, wherein each observation has an associated number of attributes, wherein each attribute has a corresponding value, and wherein a number of attributes represents a dimensionality;
  
  generate a similarity matrix using the data set, wherein the similarity matrix identifies degrees of similarity among the attributes;
  
  generate global clusters of attributes using the similarity matrix, wherein a global cluster includes a subset of the attributes, and wherein the attributes are grouped in the global clusters according to the degrees of similarity;
  
  generate a global cluster structure using the global clusters of attributes, wherein generating the global cluster structure includes determining a component for each global cluster of attributes, and performing a latent variable technique using the components;
  
  generate a sub-cluster structure using the global clusters of attributes, wherein generating a sub-cluster structure includes performing the latent variable technique or a different latent variable technique on each global cluster of attributes; and
  
  combine the global cluster structure and the sub-cluster structure to generate a cluster structure that has a fewer number of attributes than the accessed data set, wherein the fewer number of attributes represents a reduced dimensionality.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
SAS Institute Incorporated
Original Assignee
SAS Institute Incorporated
Inventors
Lee, Taiyeong, Duling, David Rawlins, Latour, Dominique Joseph
Primary Examiner(s)
Lee, Wilson

Application Number

US12/336,874
Publication Number

US 20100153456A1
Time in Patent Office

1,259 Days
Field of Search

707736-752
US Class Current

707/737
CPC Class Codes

G06F 16/2453 Query optimisation

G06F 18/23 Clustering techniques

Computer-implemented systems and methods for variable clustering in large data sets

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

16 Citations

15 Claims

Specification

Solutions

Use Cases

Quick Links

Computer-implemented systems and methods for variable clustering in large data sets

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

16 Citations

15 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links