Clustering mixed attribute patterns
First Claim
1. A method performed by a computer for clustering data points in a data set, the data set being arranged as a matrix having n objects and m attributes, the method comprising the steps of:
- converting each categorical attribute of the data set to a 1-of-p representation of the categorical attribute;
forming a converted data set A based on the data set and the 1-of-p representation for each categorical attribute;
compressing the converted data set A to obtain q basis vectors, with q being defined to be at least m+1;
projecting the converted data set onto the q basis vectors to form a data matrix having at least one vector, each vector having q dimensions; and
performing a clustering technique on the data matrix having vectors having q dimensions.
2 Assignments
0 Petitions
Accused Products
Abstract
A technique for clustering data points in a data set that is arranged as a matrix having n objects and m attributes. Each categorical attribute of the data set is converted to a 1-of-p representation of the categorical attribute. A converted data set A is formed based on the data set and the 1-of-p representation for each categorical attribute. The converted data set A is compressed using, for example, a Goal Directed Projection compression technique or a Singular Value Decomposition compression technique, to obtain q basis vectors, with q being defined to be at least m+1. The transformed data set is projected onto the q basis vectors to form a data matrix having at least one vector, with each vector having q dimensions. Lastly, a clustering technique is performed on the data matrix having vectors having q dimensions.
109 Citations
30 Claims
-
1. A method performed by a computer for clustering data points in a data set, the data set being arranged as a matrix having n objects and m attributes, the method comprising the steps of:
-
converting each categorical attribute of the data set to a 1-of-p representation of the categorical attribute;
forming a converted data set A based on the data set and the 1-of-p representation for each categorical attribute;
compressing the converted data set A to obtain q basis vectors, with q being defined to be at least m+1;
projecting the converted data set onto the q basis vectors to form a data matrix having at least one vector, each vector having q dimensions; and
performing a clustering technique on the data matrix having vectors having q dimensions. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
partitioning the objects into k cohesive groups, with k being less than m; and
computing a distance from each object to a centroid of each group.
-
-
5. The method according to claim 4, wherein the step of partitioning uses a k-means technique for clustering the data points of the converted data set.
-
6. The method according to claim 5, wherein the k-means techique uses one of a Euclidean metric, a city-block metric and a cosine similarity metric for a distance measure.
-
7. The method according to claim 5, wherein the step of clustering uses an expectation maximization clustering technique.
-
8. The method according to claim 5, wherein the step of partitioning uses a sample of the converted data set.
-
9. The method according to claim 5, wherein the step of computing the distance from each data point to the centroid of each group uses a different distance measure for each prototype vector.
-
10. The method according to claim 1, wherein the step of compressing the converted data set includes the steps of:
-
subtracting a mean of the converted data set from each object of the converted data set;
partitioning the objects into k cohesive groups, with k being less than m; and
computing a distance from each object to a centroid of each group.
-
-
11. The method according to claim 1, wherein the data set includes at least one attribute having a categorical portion and a corresponding real portion,
wherein the step of converting each categorical attribute includes the steps of: -
separating the categorical portion of each respective attribute from the corresponding real portion of the attribute, and converting the categorical portion of each attribute to a 1-of-p representation, wherein the step of compressing the converted data set A compresses the converted categorical portion of each attribute, the method further comprising the step of combining each vector having q dimensions with the associated real portion of each attribute before the step of projecting the converted data set, and wherein the step of performing the clustering technique is performed on a data matrix resulting from the step of combining each vector having q dimensions with the corresponding real portion of each attribute.
-
-
12. The method according to claim 1, wherein the data set includes at least one attribute having a categorical portion and a corresponding real portion,
wherein the step of converting each categorical attribute includes the steps of: -
separating the categorical portion of each respective attribute from the corresponding real portion of the attribute, converting the categorical portion of each respective attribute to a 1-of-p representation, and combining the 1-of-p representation of the categorical portion and the corresponding real portion of each respective attribute, and wherein the step of compressing the converted data set A compresses the combined categorical 1-of-p representation of the categorical portion and the corresponding real portion of each respective attribute.
-
-
13. The method according to claim 1, further comprising a step of computer visualizing the data points of the clustered data matrix set based on selected attributes of the data set.
-
14. The method according to claim 1, further comprising a step of computer mining the data points of the clustered data matrix based on selected attributes of the data set.
-
15. The method according to claim 1, further comprising a step of computer summarizing the clustered data matrix based on selected attributes of the data set.
-
16. A program storage device comprising:
-
a storage area; and
information stored in the storage area, the information being readable by a machine, and tangibly embodying a program of instructions executable by the machine for performing method steps comprising;
converting each categorical attribute of a data set to a 1-of-p representation of the categorical attribute, the data set being arranged as a matrix having n objects and m attributes;
forming a converted data set A based on the data set and the 1-of-p representation for each categorical attribute;
compressing the converted data set A to obtain q basis vectors, with q being defined to be at least m+1;
projecting the converted data set onto the q basis vectors to form a data matrix having at least one vector, each vector having q dimensions; and
performing a clustering technique on the data matrix having vectors having q dimensions. - View Dependent Claims (17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30)
partitioning the objects into k cohesive groups, with k being less than m; and
computing a distance from each object to a centroid of each group.
-
-
20. The program storage device according to claim 19, wherein the step of partitioning uses a k-means technique for clustering the data points of the converted data set.
-
21. The program storage device according to claim 20, wherein the k-means technique uses one of a Euclidean metric, a city-block metric and a cosine similarity metric for a distance measure.
-
22. The program storage device according to claim 20, wherein the step of clustering uses an expectation maximization clustering technique.
-
23. The program storage device according to claim 20, wherein the step of partitioning uses a sample of the converted data set.
-
24. The program storage device according to claim 20, wherein the step of computing the distance from each data point to the centroid of each group uses a different distance measure for each prototype vector.
-
25. The program storage according to claim 16, wherein the step of compressing the converted data set includes the steps of:
-
subtracting a mean of the converted data set from each object of the converted data set;
partitioning the objects into k cohesive groups, with k being less than m; and
computing a distance from each object to a centroid of each group.
-
-
26. The program storage device according to claim 16, wherein the data set includes at least one attribute having a categorical portion and a corresponding real portion,
wherein the step of converting each categorical attribute includes the steps of: -
separating the categorical portion of each respective attribute from the corresponding real portion of the attribute, and converting the categorical portion of each attribute to a 1-of-p representation, wherein the step of compressing the converted data set A compresses the converted categorical portion of each attribute, the method further comprising the step of combining each vector having p dimensions with the associated real portion of each attribute before the step of projecting the transformed data set, and wherein the step of performing the clustering technique is performed on a data matrix resulting from the step of combining each vector having q dimensions with the corresponding real portion of each attribute.
-
-
27. The program storage device according to claim 16, wherein the data set includes at least one attribute having a categorical portion and a corresponding real portion,
wherein the step of converting each categorical attribute includes the steps of: -
separating the categorical portion of each respective attribute from the corresponding real portion of the attribute, converting the categorical portion of each respective attribute to a 1-of-p representation, and combining the 1-of-p representation of the categorical portion and the corresponding real portion of each respective attribute, and wherein the step of compressing the converted data set A compresses the combined categorical 1-of-p representation of the categorical portion and the corresponding real portion of each respective attribute.
-
-
28. The program storage device according to claim 16, further comprising a step of computer visualizing the data points of the data set based on selected attributes of the data set.
-
29. The program storage device according to claim 16, further comprising a step of computer mining the data points of the clustered data matrix based on selected attributes of the data set.
-
30. The program storage device according to claim 16, further comprising a step of computer summarizing the clustered data matrix based on selected attributes of the data set.
Specification