Computer-implemented systems and methods for variable clustering in large data sets
First Claim
Patent Images
1. A computer-implemented method for reducing dimensionality of a data set, comprising:
- accessing, using one or more data processors, a data set including a plurality of observations, wherein each observation has an associated number of attributes, wherein each attribute has a corresponding value, and wherein a number of attributes represents a dimensionality;
generating, using the one or more data processors, a similarity matrix using the data set, wherein the similarity matrix identifies degrees of similarity among the attributes;
generating, using the one or more data processors, global clusters of attributes using the similarity matrix, wherein a global cluster includes a subset of the attributes, and wherein the attributes are grouped in the global clusters according to the degrees of similarity;
generating, using the one or more data processors, a global cluster structure using the global clusters of attributes, wherein generating the global cluster structure includes determining a component for each global cluster of attributes, and performing a latent variable technique using the components;
generating, using the one or more data processors, a sub-cluster structure using the global clusters of attributes, wherein generating a sub-cluster structure includes performing the latent variable technique or a different latent variable technique on each global cluster of attributes; and
combining, using the one or more data processors, the global cluster structure and the sub-cluster structure to generate a cluster structure that has a fewer number of attributes than the accessed data set, wherein the fewer number of attributes represents a reduced dimensionality.
1 Assignment
0 Petitions
Accused Products
Abstract
Computer-implemented systems and methods are provided for creating a cluster structure from a data set containing input variables. Global clusters are created within a first stage, by computing a similarity matrix from the data set. A global cluster structure and sub-cluster structure are created within a second stage, where the global cluster structure and the sub-cluster structure are created using a latent variable clustering technique and the cluster structure output is generated by combining the created global cluster structure and the created sub-cluster structure.
16 Citations
15 Claims
-
1. A computer-implemented method for reducing dimensionality of a data set, comprising:
-
accessing, using one or more data processors, a data set including a plurality of observations, wherein each observation has an associated number of attributes, wherein each attribute has a corresponding value, and wherein a number of attributes represents a dimensionality; generating, using the one or more data processors, a similarity matrix using the data set, wherein the similarity matrix identifies degrees of similarity among the attributes; generating, using the one or more data processors, global clusters of attributes using the similarity matrix, wherein a global cluster includes a subset of the attributes, and wherein the attributes are grouped in the global clusters according to the degrees of similarity; generating, using the one or more data processors, a global cluster structure using the global clusters of attributes, wherein generating the global cluster structure includes determining a component for each global cluster of attributes, and performing a latent variable technique using the components; generating, using the one or more data processors, a sub-cluster structure using the global clusters of attributes, wherein generating a sub-cluster structure includes performing the latent variable technique or a different latent variable technique on each global cluster of attributes; and combining, using the one or more data processors, the global cluster structure and the sub-cluster structure to generate a cluster structure that has a fewer number of attributes than the accessed data set, wherein the fewer number of attributes represents a reduced dimensionality. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
-
-
14. A system for reducing dimensionality of a data set, comprising:
-
one or more processors; one or more non-transitory computer-readable storage mediums containing instructions configured to cause the one or more processors to perform operations including; accessing a data set including a plurality of observations, wherein each observation has an associated number of attributes, wherein each attribute has a corresponding value, and wherein a number of attributes represents a dimensionality; generating a similarity matrix using the data set, wherein the similarity matrix identifies degrees of similarity among the attributes; generating global clusters of attributes using the similarity matrix, wherein a global cluster includes a subset of the attributes, and wherein the attributes are grouped in the global clusters according to the degrees of similarity; generating a global cluster structure using the global clusters of attributes, wherein generating the global cluster structure includes determining a component for each global cluster of attributes, and performing a latent variable technique using the components; generating a sub-cluster structure using the global clusters of attributes, wherein generating a sub-cluster structure includes performing the latent variable technique or a different latent variable technique on each global cluster of attributes; and combining the global cluster structure and the sub-cluster structure to generate a cluster structure that has a fewer number of attributes than the accessed data set, wherein the fewer number of attributes represents a reduced dimensionality.
-
-
15. A non-transitory computer program product for reducing dimensionality of a data set, tangibly embodied in a machine-readable non-transitory storage medium, including instructions configured to cause a data processing system to:
-
access a data set including a plurality of observations, wherein each observation has an associated number of attributes, wherein each attribute has a corresponding value, and wherein a number of attributes represents a dimensionality; generate a similarity matrix using the data set, wherein the similarity matrix identifies degrees of similarity among the attributes; generate global clusters of attributes using the similarity matrix, wherein a global cluster includes a subset of the attributes, and wherein the attributes are grouped in the global clusters according to the degrees of similarity; generate a global cluster structure using the global clusters of attributes, wherein generating the global cluster structure includes determining a component for each global cluster of attributes, and performing a latent variable technique using the components; generate a sub-cluster structure using the global clusters of attributes, wherein generating a sub-cluster structure includes performing the latent variable technique or a different latent variable technique on each global cluster of attributes; and combine the global cluster structure and the sub-cluster structure to generate a cluster structure that has a fewer number of attributes than the accessed data set, wherein the fewer number of attributes represents a reduced dimensionality.
-
Specification