Method and apparatus for scalable probabilistic clustering using decision trees
First Claim
1. A business method for creating an apparatus for determining a corporate strategy, the method comprising:
- retrieving data values for attributes from a set of data;
computing pair-wise influences among attributes based on a distribution of all remaining attributes using the retrieved data values;
selecting an attribute on which to form a branch in a decision tree for clustering based on the computed pair-wise influences;
iterating through said selecting and said computing until a stopping condition is detected;
when the stopping condition is detected, identifying one or more clusters, wherein each cluster corresponds to a node in the decision tree, and wherein each cluster is comprised of a subset of the set of data as defined by a corresponding node in the decision tree; and
displaying the one or more clusters in a visual format to assist a user in setting corporate strategy.
14 Assignments
0 Petitions
Accused Products
Abstract
Some embodiments of the invention include methods for identifying clusters in a database, data warehouse or data mart. The identified clusters can be meaningfully understood by a list of the attributes and corresponding values for each of the clusters. Some embodiments of the invention include a method for scalable probabilistic clustering using a decision tree. Some embodiments of the invention, perform linearly in the size of the set of data and only require a single access to the set of data. Some embodiments of the invention produce interpretable clusters that can be described in terms of a set of attributes and attribute values for that set of attributes. In some embodiments, the cluster can be interpreted by reading the attribute values and attributes on the path from the root node of the decision tree to the node of the decision tree corresponding to the cluster. In some embodiments, it is not necessary for there to be a domain specific distance function for the attributes. In some embodiments, a cluster is determined by identifying an attribute with the highest influence on the distribution of the other attributes. Each of the values assumed by the identified attribute corresponds to a cluster, and a node in the decision tree. In some embodiments, the CUBE operation is used to access the set of data a single time and the result is used to compute the influence and other calculations.
-
Citations
59 Claims
-
1. A business method for creating an apparatus for determining a corporate strategy, the method comprising:
-
retrieving data values for attributes from a set of data;
computing pair-wise influences among attributes based on a distribution of all remaining attributes using the retrieved data values;
selecting an attribute on which to form a branch in a decision tree for clustering based on the computed pair-wise influences;
iterating through said selecting and said computing until a stopping condition is detected;
when the stopping condition is detected, identifying one or more clusters, wherein each cluster corresponds to a node in the decision tree, and wherein each cluster is comprised of a subset of the set of data as defined by a corresponding node in the decision tree; and
displaying the one or more clusters in a visual format to assist a user in setting corporate strategy. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
eliminating one or more attributes based on user input.
-
-
5. The method of claim 1, further comprising:
eliminating one or more attributes to obtain a set of k attributes based on uniformity of distribution of values of the attribute.
-
6. The method of claim 5, wherein eliminating further comprises:
-
computing an entropy for each attribute; and
retaining in the set of attributes the k attributes with the lowest entropy.
-
-
7. The method of claim 1, wherein a user interface is provided.
-
8. The method of claim 7, further comprising:
-
displaying a list of attributes; and
receiving user input of a selection of attributes for which data is to be retrieved.
-
-
9. The method of claim 7, further comprising:
-
displaying filter data; and
receiving user input of a selection of attributes and attribute values used to filter the set of data before data is retrieved.
-
-
10. The method of claim 7, further comprising:
determining how many clusters are to be output based on user input.
-
11. The method of claim 7, further comprising:
determining whether to display the output in a chart based on user input.
-
12. The method of claim 7, further comprising:
eliminating one or more attributes based on user input.
-
13. The method of claim 7, further comprising:
outputting only clusters based on a number of data points in the cluster.
-
14. The method of claim 7, further comprising:
performing subsetting in response to user input.
-
15. The method of claim 1, wherein computing the pair-wise influences among attributes further comprises:
computing the mutual information of pairs of attributes in the set of attributes.
-
16. The method of claim 1, further comprising creating a subset from a node in the decision tree for clustering, wherein creating the subset comprises:
-
identifying a first attribute, the first attribute having a first highest influence;
selecting a second attribute in the set of attributes, the second attribute having a high mutual influence with respect to the first attribute;
computing a set of vectors for each value of the first attribute, wherein each vector is comprised of a probability of each value of the second attribute;
computing a Bhattacharyya distance between each pair of vectors; and
forming a subset if the Bhattacharyya distance between a pair of vector exceeds a predetermined amount.
-
-
17. The method of claim 1, wherein a path from a root node in the decision tree for clustering to a leaf node identifies a structured query language (SQL) query for retrieving a corresponding cluster from the set of data.
-
18. The method of claim 1, wherein influence is computed with
-
( x i ) = ∑ j ≠ i MI ( x i , x j ) , where MI ( X i , X j ) = ∑ y , z P ( x i , = y , x j = z ) log P ( x i = y , x j = z ) P ( x i = y ) P ( x j = z ) , where y ∈ { attribute values of x i } an d z ∈ { attribute values of x j } .
-
-
19. The method of claim 1, wherein each node corresponding to a cluster comprises a leaf node of the decision tree for clustering.
-
20. A system for creating an apparatus for determining a corporate strategy, the system comprising:
-
a computer, and a computer program executable by the computer, the computer program comprising computer instructions for;
retrieving data values for attributes from the set of data;
computing pair-wise influences among attributes based on a distribution of all remaining attributes using the retrieved data values;
selecting an attribute on which to form a branch in a decision tree for clustering based on the computed pair-wise influences;
iterating through said compute instructions and said select instructions until a stopping condition is detected;
when the stopping condition is detected, identifying one or more clusters, wherein each cluster corresponds to a node in the decision tree, and wherein each cluster is comprised of a subset of the set of data as defined by its corresponding node in the decision tree; and
displaying the one or more clusters in a visual format to assist a user in setting corporate strategy. - View Dependent Claims (21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38)
eliminating one or more attributes based on user input.
-
-
24. The system of claim 20, wherein the computer program further comprises:
eliminating one or more attributes to obtain a set of k attributes based on uniformity of distribution of values of the attribute.
-
25. The system of claim 24, wherein the computer program to perform eliminating further comprises:
-
computing an entropy for each attribute; and
retaining in the set of attributes the k attributes with the lowest entropy.
-
-
26. The system of claim 20, wherein a user interface is provided.
-
27. The system of claim 26, wherein the computer program further comprises:
-
displaying a list of attributes; and
receiving user input of a selection of attributes for which data is to be retrieved.
-
-
28. The system of claim 26, wherein the computer program further comprises:
-
displaying filter data; and
receiving user input of a selection of attributes and attribute values used to filter the set of data before data is retrieved.
-
-
29. The system of claim 26, wherein execution of the computer program further comprises:
determining how many clusters are to be output based on user input.
-
30. The system of claim 26, wherein the computer program further comprises:
determining whether to display the output in a chart based on user input.
-
31. The system of claim 26, wherein the computer program further comprises:
eliminating one or more attributes based on user input.
-
32. The system of claim 26, wherein the computer program further comprises:
outputting only clusters based on a number of data points in the cluster.
-
33. The system of claim 26, wherein the computer program further comprises:
performing subsetting in response to user input.
-
34. The system of claim 20, wherein the computer program to compute the pair-wise influences among attributes includes further comprises:
computing mutual information of pairs of attributes in the set of attributes.
-
35. The system of claim 20, wherein the computer program further comprises creating a subset from a node in the decision tree for clustering, wherein the computer program to create the subset comprises:
-
identifying a first attribute, the first attribute having a first highest influence;
selecting a second attribute in the set of attributes, the second attribute having a high mutual influence with respect to the first attribute;
computing a set of vectors for each value of the first attribute, wherein each vector is comprised of a probability of each value of the second attribute;
computing a Bhattacharyya distance between each pair of vectors; and
forming a subset if the Bhattacharyya distance between a pair of vector exceeds a predetermined amount.
-
-
36. The system of claim 20, wherein a path from a root node in the decision tree for clustering to a leaf node identifies a structured query language (SQL) query for retrieving a corresponding cluster from the set of data.
-
37. The system of claim 20, wherein the computer program computes influence with
-
( x i ) = ∑ j ≠ i MI ( x i , x j ) , where MI ( X i , X j ) = ∑ y , z P ( x i , = y , x j = z ) log P ( x i = y , x j = z ) P ( x i = y ) P ( x j = z ) , where y ∈ { attribute values of x i } an d z ∈ { attribute values of x j } .
-
-
38. The system of claim 20, wherein each node corresponding to a cluster comprises a leaf node of the decision tree for clustering.
-
39. A computer-readable storage medium encoded with software instructions adapted to be executed by a processor, the instructions comprising:
-
retrieving data values for attributes from the set of data;
computing pair-wise influences among attributes based on a distribution of all remaining attributes using the retrieved data values;
selecting an attribute on which to form a branch in a decision tree for clustering based on the computed pair-wise influences;
iterating through the retrieving steps and the selecting steps until a stopping condition is detected;
when the stopping condition is detected, identifying one or more clusters, wherein each cluster corresponds to a node in the decision tree for clustering, and wherein each cluster is comprised of a subset of the set of data as defined by its corresponding node in the decision tree for clustering; and
displaying the one or more clusters in a visual format to assist a user in setting corporate strategy. - View Dependent Claims (40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57)
eliminating one or more attributes based on user input.
-
-
43. The computer-readable storage medium of claim 39, wherein the instructions further comprise:
eliminating one or more attributes to obtain a set of k attributes based on uniformity of distribution of values of the attribute.
-
44. The computer-readable storage medium of claim 43, wherein the instructions further comprise:
-
computing an entropy for each attribute; and
retaining in the set of attributes the k attributes with the lowest entropy.
-
-
45. The computer-readable storage medium of claim 39, wherein a user interface is provided.
-
46. The computer-readable storage medium of claim 45, wherein the instructions further comprise:
-
displaying a list of attributes; and
receiving user input of a selection of attributes for which data is to be retrieved.
-
-
47. The computer-readable storage medium of claim 45, wherein the instructions further comprise:
-
displaying filter data; and
receiving user input of a selection of attributes and attribute values used to filter the set of data before data is retrieved.
-
-
48. The computer-readable storage medium of claim 45, wherein the instructions further comprise:
determining how many clusters are to be output based on user input.
-
49. The computer-readable storage medium of claim 45, wherein the instructions further comprise:
determining whether to display the output in a chart based on user input.
-
50. The computer-readable storage medium of claim 45, wherein the instructions further comprise:
eliminating one or more attributes based on user input.
-
51. The computer-readable storage medium of claim 45, wherein the instructions further comprise:
outputting only clusters based on a number of data points in the cluster.
-
52. The computer-readable storage medium of claim 45, wherein the instructions further comprise:
performing subsetting in response to user input.
-
53. The computer-readable storage medium of claim 39, wherein the instructions further comprise:
computing the mutual information of pairs of attributes in the set of attributes.
-
54. The computer-readable storage medium of claim 39, wherein the instructions further comprise creating a subset from a node in the decision tree for clustering, and wherein the instructions to create the subset further comprises:
-
identifying a first attribute, the first attribute having a first highest influence;
selecting a second attribute in the set of attributes, the second attribute having a high mutual influence with respect to the first attribute;
computing a set of vectors for each value of the first attribute, wherein each vector is comprised of a probability of each value of the second attribute;
computing a Bhattacharyya distance between each pair of vectors; and
forming a subset if the Bhattacharyya distance between a pair of vector exceeds a predetermined amount.
-
-
55. The computer-readable storage medium of claim 39, wherein a path from a root node in the decision tree for clustering to a leaf node identifies a structured query language (SQL) query for retrieving a corresponding cluster from the set of data.
-
56. The computer-readable storage medium of claim 39, wherein the instructions to compute influence includes the equation
-
( x i ) = ∑ j ≠ i MI ( x i , x j ) , where MI ( X i , X j ) = ∑ y , z P ( x i , = y , x j = z ) log P ( x i = y , x j = z ) P ( x i = y ) P ( x j = z ) , where y ∈ { attribute values of x i } an d z ∈ { attribute values of x j } .
-
-
57. The computer-readable storage medium of claim 39, wherein each node corresponding to a cluster comprises a leaf nods of the decision tree for clustering.
-
58. A system for creating a data filter for determining a corporate strategy, the system comprising:
-
(a) means for retrieving data values for attributes from the set of data;
(b) means for computing pair-wise influences among attributes based on a distribution of all remaining attributes using the retrieved data values;
(c) means for selecting an attribute on which to form a branch in a decision tree for clustering based on the computed pair-wise influences;
(d) means for iterating through acts (b) and (c) until a stopping condition is detected;
(e) means for, when the stopping condition is detected, identifying one or more clusters, wherein each cluster corresponds to a node in the decision tree for clustering, and wherein each cluster is comprised of a subset of the set of data as defined by its corresponding node in the decision tree for clustering; and
means for displaying the one or more clusters in a visual format to assist a user in setting corporate strategy. - View Dependent Claims (59)
-
Specification