Method and apparatus for clustering hierarchically related information
First Claim
1. ) A method for clustering hierarchically related information, where the information can be represented as a set of nodes wherein each node is associated with a portion of the information and the nodes are connected by directed edges wherein a parent node is the source of an- incoming edge, a child node is the target of an outgoing edge, sibling nodes have the same parent node, and a path is a sequence of nodes, comprising:
- a) associating a cluster having a size with each node, b) identifying related node pairs, c) determining a node distance for each node pair, d) determining if there is more than one cluster remaining, e) responsive to the determination in step d, selecting a pair of clusters to combine and determining their cluster distance, f) determining whether the cluster distance of the selected cluster pair is less than a maximum distance and whether the sum of the sizes of the two clusters of the selected cluster pair is less than a specified halting cluster size, g) responsive to the determination in step f, combining the selected cluster pair into a combined cluster and associating the combined cluster with each of the nodes previously associated with each of the individual clusters in the selected cluster pair, and h) repeating steps e, f, and g until the selected cluster pair has a cluster distance greater than a maximum distance or the sum of the sizes of the two clusters of the cluster pair is greater than a specified halting cluster size.
2 Assignments
0 Petitions
Accused Products
Abstract
A method for partitioning a tree-structured discussion or other tree structured collections of texts into clusters dealing with identifiable subtopics, if such subtopics exist, or into manageable partitions if not. Each document is represented by a vector and is initially placed in a cluster containing only that-document. Then a sequence of cluster combinations is performed, at each step combining the most similar two clusters, where the most similar two clusters are the clusters related by the most similar pair of document vectors, into a new cluster. The process can be halted before all clusters are combined based on application-specific criteria.
-
Citations
15 Claims
-
1. ) A method for clustering hierarchically related information, where the information can be represented as a set of nodes wherein each node is associated with a portion of the information and the nodes are connected by directed edges wherein a parent node is the source of an- incoming edge, a child node is the target of an outgoing edge, sibling nodes have the same parent node, and a path is a sequence of nodes, comprising:
-
a) associating a cluster having a size with each node, b) identifying related node pairs, c) determining a node distance for each node pair, d) determining if there is more than one cluster remaining, e) responsive to the determination in step d, selecting a pair of clusters to combine and determining their cluster distance, f) determining whether the cluster distance of the selected cluster pair is less than a maximum distance and whether the sum of the sizes of the two clusters of the selected cluster pair is less than a specified halting cluster size, g) responsive to the determination in step f, combining the selected cluster pair into a combined cluster and associating the combined cluster with each of the nodes previously associated with each of the individual clusters in the selected cluster pair, and h) repeating steps e, f, and g until the selected cluster pair has a cluster distance greater than a maximum distance or the sum of the sizes of the two clusters of the cluster pair is greater than a specified halting cluster size. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
-
Specification