Method and apparatus for scalable probabilistic clustering using decision trees

US 6,564,197 B2
Filed: 05/03/1999
Issued: 05/13/2003
Est. Priority Date: 05/03/1999
Status: Expired due to Term

First Claim

Patent Images

1. A business method for creating an apparatus for determining a corporate strategy, the method comprising:

retrieving data values for attributes from a set of data;

computing pair-wise influences among attributes based on a distribution of all remaining attributes using the retrieved data values;

selecting an attribute on which to form a branch in a decision tree for clustering based on the computed pair-wise influences;

iterating through said selecting and said computing until a stopping condition is detected;

when the stopping condition is detected, identifying one or more clusters, wherein each cluster corresponds to a node in the decision tree, and wherein each cluster is comprised of a subset of the set of data as defined by a corresponding node in the decision tree; and

displaying the one or more clusters in a visual format to assist a user in setting corporate strategy.

View all claims

14 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Some embodiments of the invention include methods for identifying clusters in a database, data warehouse or data mart. The identified clusters can be meaningfully understood by a list of the attributes and corresponding values for each of the clusters. Some embodiments of the invention include a method for scalable probabilistic clustering using a decision tree. Some embodiments of the invention, perform linearly in the size of the set of data and only require a single access to the set of data. Some embodiments of the invention produce interpretable clusters that can be described in terms of a set of attributes and attribute values for that set of attributes. In some embodiments, the cluster can be interpreted by reading the attribute values and attributes on the path from the root node of the decision tree to the node of the decision tree corresponding to the cluster. In some embodiments, it is not necessary for there to be a domain specific distance function for the attributes. In some embodiments, a cluster is determined by identifying an attribute with the highest influence on the distribution of the other attributes. Each of the values assumed by the identified attribute corresponds to a cluster, and a node in the decision tree. In some embodiments, the CUBE operation is used to access the set of data a single time and the result is used to compute the influence and other calculations.

Citations

59 Claims

1. A business method for creating an apparatus for determining a corporate strategy, the method comprising:
- retrieving data values for attributes from a set of data;
  
  computing pair-wise influences among attributes based on a distribution of all remaining attributes using the retrieved data values;
  
  selecting an attribute on which to form a branch in a decision tree for clustering based on the computed pair-wise influences;
  
  iterating through said selecting and said computing until a stopping condition is detected;
  
  when the stopping condition is detected, identifying one or more clusters, wherein each cluster corresponds to a node in the decision tree, and wherein each cluster is comprised of a subset of the set of data as defined by a corresponding node in the decision tree; and
  
  displaying the one or more clusters in a visual format to assist a user in setting corporate strategy.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
- - 2. The method of claim 1, wherein the data is retrieved using a CUBE operation that results in data comprising frequency information for the attributes.
  - 3. The method of claim 1, wherein the data is retrieved using a GROUP BY operation.
  - 4. The method of claim 1, further comprising:
5. The method of claim 1, further comprising:
- eliminating one or more attributes to obtain a set of k attributes based on uniformity of distribution of values of the attribute.
6. The method of claim 5, wherein eliminating further comprises:
- computing an entropy for each attribute; and
  
  retaining in the set of attributes the k attributes with the lowest entropy.
7. The method of claim 1, wherein a user interface is provided.
8. The method of claim 7, further comprising:
- displaying a list of attributes; and
  
  receiving user input of a selection of attributes for which data is to be retrieved.
9. The method of claim 7, further comprising:
- displaying filter data; and
  
  receiving user input of a selection of attributes and attribute values used to filter the set of data before data is retrieved.
10. The method of claim 7, further comprising:
- determining how many clusters are to be output based on user input.
11. The method of claim 7, further comprising:
- determining whether to display the output in a chart based on user input.
12. The method of claim 7, further comprising:
- eliminating one or more attributes based on user input.
13. The method of claim 7, further comprising:
- outputting only clusters based on a number of data points in the cluster.
14. The method of claim 7, further comprising:
- performing subsetting in response to user input.
15. The method of claim 1, wherein computing the pair-wise influences among attributes further comprises:
- computing the mutual information of pairs of attributes in the set of attributes.
16. The method of claim 1, further comprising creating a subset from a node in the decision tree for clustering, wherein creating the subset comprises:
- identifying a first attribute, the first attribute having a first highest influence;
  
  selecting a second attribute in the set of attributes, the second attribute having a high mutual influence with respect to the first attribute;
  
  computing a set of vectors for each value of the first attribute, wherein each vector is comprised of a probability of each value of the second attribute;
  
  computing a Bhattacharyya distance between each pair of vectors; and
  
  forming a subset if the Bhattacharyya distance between a pair of vector exceeds a predetermined amount.
17. The method of claim 1, wherein a path from a root node in the decision tree for clustering to a leaf node identifies a structured query language (SQL) query for retrieving a corresponding cluster from the set of data.
18. The method of claim 1, wherein influence is computed with $Influence$
- (xi)=∑
  
  j≠
  
  i
  
  MI
  
  (xi,xj),
  
  
  
  where
  
  
  
  MI
  
  (Xi,Xj)=∑
  
  y,z
  
  P
  
  (xi,=y,xj=z)
  
  log
  
  P
  
  (xi=y,xj=z)P
  
  (xi=y)
  
  P
  
  (xj=z),
  
  where
  
  
  
  y∈
  
  {attribute
  
  
  
  values
  
  
  
  of
  
  
  
  xi}
  
  
  
  an
  
  d
  
  
  
  z∈
  
  {attribute
  
  
  
  values
  
  
  
  of
  
  
  
  xj}.
19. The method of claim 1, wherein each node corresponding to a cluster comprises a leaf node of the decision tree for clustering.

20. A system for creating an apparatus for determining a corporate strategy, the system comprising:
- a computer, and a computer program executable by the computer, the computer program comprising computer instructions for;
  
  retrieving data values for attributes from the set of data;
  
  computing pair-wise influences among attributes based on a distribution of all remaining attributes using the retrieved data values;
  
  selecting an attribute on which to form a branch in a decision tree for clustering based on the computed pair-wise influences;
  
  iterating through said compute instructions and said select instructions until a stopping condition is detected;
  
  when the stopping condition is detected, identifying one or more clusters, wherein each cluster corresponds to a node in the decision tree, and wherein each cluster is comprised of a subset of the set of data as defined by its corresponding node in the decision tree; and
  
  displaying the one or more clusters in a visual format to assist a user in setting corporate strategy.
- View Dependent Claims (21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38)
- - 21. The system of claim 20, wherein the data is retrieved using a CUBE operation that results in data comprising frequency information for the attributes.
  - 22. The system of claim 20, wherein the data is retrieved using a GROUP BY operation.
  - 23. The system of claim 20, wherein the computer program further comprises:
24. The system of claim 20, wherein the computer program further comprises:
- eliminating one or more attributes to obtain a set of k attributes based on uniformity of distribution of values of the attribute.
25. The system of claim 24, wherein the computer program to perform eliminating further comprises:
- computing an entropy for each attribute; and
  
  retaining in the set of attributes the k attributes with the lowest entropy.
26. The system of claim 20, wherein a user interface is provided.
27. The system of claim 26, wherein the computer program further comprises:
- displaying a list of attributes; and
  
  receiving user input of a selection of attributes for which data is to be retrieved.
28. The system of claim 26, wherein the computer program further comprises:
- displaying filter data; and
  
  receiving user input of a selection of attributes and attribute values used to filter the set of data before data is retrieved.
29. The system of claim 26, wherein execution of the computer program further comprises:
- determining how many clusters are to be output based on user input.
30. The system of claim 26, wherein the computer program further comprises:
- determining whether to display the output in a chart based on user input.
31. The system of claim 26, wherein the computer program further comprises:
- eliminating one or more attributes based on user input.
32. The system of claim 26, wherein the computer program further comprises:
- outputting only clusters based on a number of data points in the cluster.
33. The system of claim 26, wherein the computer program further comprises:
- performing subsetting in response to user input.
34. The system of claim 20, wherein the computer program to compute the pair-wise influences among attributes includes further comprises:
- computing mutual information of pairs of attributes in the set of attributes.
35. The system of claim 20, wherein the computer program further comprises creating a subset from a node in the decision tree for clustering, wherein the computer program to create the subset comprises:
- identifying a first attribute, the first attribute having a first highest influence;
  
  selecting a second attribute in the set of attributes, the second attribute having a high mutual influence with respect to the first attribute;
  
  computing a set of vectors for each value of the first attribute, wherein each vector is comprised of a probability of each value of the second attribute;
  
  computing a Bhattacharyya distance between each pair of vectors; and
  
  forming a subset if the Bhattacharyya distance between a pair of vector exceeds a predetermined amount.
36. The system of claim 20, wherein a path from a root node in the decision tree for clustering to a leaf node identifies a structured query language (SQL) query for retrieving a corresponding cluster from the set of data.
37. The system of claim 20, wherein the computer program computes influence with $Influence$
- (xi)=∑
  
  j≠
  
  i
  
  MI
  
  (xi,xj),
  
  
  
  where
  
  
  
  MI
  
  (Xi,Xj)=∑
  
  y,z
  
  P
  
  (xi,=y,xj=z)
  
  log
  
  P
  
  (xi=y,xj=z)P
  
  (xi=y)
  
  P
  
  (xj=z),
  
  where
  
  
  
  y∈
  
  {attribute
  
  
  
  values
  
  
  
  of
  
  
  
  xi}
  
  
  
  an
  
  d
  
  
  
  z∈
  
  {attribute
  
  
  
  values
  
  
  
  of
  
  
  
  xj}.
38. The system of claim 20, wherein each node corresponding to a cluster comprises a leaf node of the decision tree for clustering.

39. A computer-readable storage medium encoded with software instructions adapted to be executed by a processor, the instructions comprising:
- retrieving data values for attributes from the set of data;
  
  computing pair-wise influences among attributes based on a distribution of all remaining attributes using the retrieved data values;
  
  selecting an attribute on which to form a branch in a decision tree for clustering based on the computed pair-wise influences;
  
  iterating through the retrieving steps and the selecting steps until a stopping condition is detected;
  
  when the stopping condition is detected, identifying one or more clusters, wherein each cluster corresponds to a node in the decision tree for clustering, and wherein each cluster is comprised of a subset of the set of data as defined by its corresponding node in the decision tree for clustering; and
  
  displaying the one or more clusters in a visual format to assist a user in setting corporate strategy.
- View Dependent Claims (40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57)
- - 40. The computer-readable storage medium of claim 39, wherein the data is retrieved using a CUBE operation that results in data comprising frequency information for the attributes.
  - 41. The computer-readable storage medium of claim 39, wherein the data is retrieved using a GROUP BY operation.
  - 42. The computer-readable storage medium of claim 39, wherein the instructions further comprise:
43. The computer-readable storage medium of claim 39, wherein the instructions further comprise:
- eliminating one or more attributes to obtain a set of k attributes based on uniformity of distribution of values of the attribute.
44. The computer-readable storage medium of claim 43, wherein the instructions further comprise:
- computing an entropy for each attribute; and
  
  retaining in the set of attributes the k attributes with the lowest entropy.
45. The computer-readable storage medium of claim 39, wherein a user interface is provided.
46. The computer-readable storage medium of claim 45, wherein the instructions further comprise:
- displaying a list of attributes; and
  
  receiving user input of a selection of attributes for which data is to be retrieved.
47. The computer-readable storage medium of claim 45, wherein the instructions further comprise:
- displaying filter data; and
  
  receiving user input of a selection of attributes and attribute values used to filter the set of data before data is retrieved.
48. The computer-readable storage medium of claim 45, wherein the instructions further comprise:
- determining how many clusters are to be output based on user input.
49. The computer-readable storage medium of claim 45, wherein the instructions further comprise:
- determining whether to display the output in a chart based on user input.
50. The computer-readable storage medium of claim 45, wherein the instructions further comprise:
- eliminating one or more attributes based on user input.
51. The computer-readable storage medium of claim 45, wherein the instructions further comprise:
- outputting only clusters based on a number of data points in the cluster.
52. The computer-readable storage medium of claim 45, wherein the instructions further comprise:
- performing subsetting in response to user input.
53. The computer-readable storage medium of claim 39, wherein the instructions further comprise:
- computing the mutual information of pairs of attributes in the set of attributes.
54. The computer-readable storage medium of claim 39, wherein the instructions further comprise creating a subset from a node in the decision tree for clustering, and wherein the instructions to create the subset further comprises:
- identifying a first attribute, the first attribute having a first highest influence;
  
  selecting a second attribute in the set of attributes, the second attribute having a high mutual influence with respect to the first attribute;
  
  computing a set of vectors for each value of the first attribute, wherein each vector is comprised of a probability of each value of the second attribute;
  
  computing a Bhattacharyya distance between each pair of vectors; and
  
  forming a subset if the Bhattacharyya distance between a pair of vector exceeds a predetermined amount.
55. The computer-readable storage medium of claim 39, wherein a path from a root node in the decision tree for clustering to a leaf node identifies a structured query language (SQL) query for retrieving a corresponding cluster from the set of data.
56. The computer-readable storage medium of claim 39, wherein the instructions to compute influence includes the equation $Influence$
- (xi)=∑
  
  j≠
  
  i
  
  MI
  
  (xi,xj),
  
  
  
  where
  
  
  
  MI
  
  (Xi,Xj)=∑
  
  y,z
  
  P
  
  (xi,=y,xj=z)
  
  log
  
  P
  
  (xi=y,xj=z)P
  
  (xi=y)
  
  P
  
  (xj=z),
  
  where
  
  
  
  y∈
  
  {attribute
  
  
  
  values
  
  
  
  of
  
  
  
  xi}
  
  
  
  an
  
  d
  
  
  
  z∈
  
  {attribute
  
  
  
  values
  
  
  
  of
  
  
  
  xj}.
57. The computer-readable storage medium of claim 39, wherein each node corresponding to a cluster comprises a leaf nods of the decision tree for clustering.

58. A system for creating a data filter for determining a corporate strategy, the system comprising:
- (a) means for retrieving data values for attributes from the set of data;
  
  (b) means for computing pair-wise influences among attributes based on a distribution of all remaining attributes using the retrieved data values;
  
  (c) means for selecting an attribute on which to form a branch in a decision tree for clustering based on the computed pair-wise influences;
  
  (d) means for iterating through acts (b) and (c) until a stopping condition is detected;
  
  (e) means for, when the stopping condition is detected, identifying one or more clusters, wherein each cluster corresponds to a node in the decision tree for clustering, and wherein each cluster is comprised of a subset of the set of data as defined by its corresponding node in the decision tree for clustering; and
  
  means for displaying the one or more clusters in a visual format to assist a user in setting corporate strategy.
- View Dependent Claims (59)
- - 59. The system of claim 58, further comprising means for computing influence with $Influence$
    - (xi)=∑
      
      j≠
      
      i
      
      MI
      
      (xi,xj),
      
      
      
      where
      
      
      
      MI
      
      (Xi,Xj)=∑
      
      y,z
      
      P
      
      (xi,=y,xj=z)
      
      log
      
      P
      
      (xi=y,xj=z)P
      
      (xi=y)
      
      P
      
      (xj=z),
      
      where
      
      
      
      y∈
      
      {attribute
      
      
      
      values
      
      
      
      of
      
      
      
      xi}
      
      
      
      an
      
      d
      
      
      
      z∈
      
      {attribute
      
      
      
      values
      
      
      
      of
      
      
      
      xj}.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
EPiphany Incorporated (Infor Incorporated)
Original Assignee
EPiphany Incorporated (Infor Incorporated)
Inventors
Sahami, Mehran, John, George Harrison
Primary Examiner(s)
STARKS, WILBERT L

Application Number

US09/304,509
Publication Number

US 20030065635A1
Time in Patent Office

1,471 Days
Field of Search

706/55, 345/440, 382/224, 703/2, 707/6
US Class Current

706/55
CPC Class Codes

G06F 16/35   Clustering; Classification

G06F 18/23   Clustering techniques

G06F 18/24323   Tree-organised classifiers

Y10S 707/99936   Pattern matching access

Method and apparatus for scalable probabilistic clustering using decision trees

First Claim

14 Assignments

0 Petitions

Accused Products

Abstract

Citations

59 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for scalable probabilistic clustering using decision trees

First Claim

14 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

59 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links