Method and apparatus for clustering hierarchically related information

US 20040117448A1
Filed: 12/16/2002
Published: 06/17/2004
Est. Priority Date: 12/16/2002
Status: Active Grant

First Claim

Patent Images

1. ) A method for clustering hierarchically related information, where the information can be represented as a set of nodes wherein each node is associated with a portion of the information and the nodes are connected by directed edges wherein a parent node is the source of an- incoming edge, a child node is the target of an outgoing edge, sibling nodes have the same parent node, and a path is a sequence of nodes, comprising:

a) associating a cluster having a size with each node, b) identifying related node pairs, c) determining a node distance for each node pair, d) determining if there is more than one cluster remaining, e) responsive to the determination in step d, selecting a pair of clusters to combine and determining their cluster distance, f) determining whether the cluster distance of the selected cluster pair is less than a maximum distance and whether the sum of the sizes of the two clusters of the selected cluster pair is less than a specified halting cluster size, g) responsive to the determination in step f, combining the selected cluster pair into a combined cluster and associating the combined cluster with each of the nodes previously associated with each of the individual clusters in the selected cluster pair, and h) repeating steps e, f, and g until the selected cluster pair has a cluster distance greater than a maximum distance or the sum of the sizes of the two clusters of the cluster pair is greater than a specified halting cluster size.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method for partitioning a tree-structured discussion or other tree structured collections of texts into clusters dealing with identifiable subtopics, if such subtopics exist, or into manageable partitions if not. Each document is represented by a vector and is initially placed in a cluster containing only that-document. Then a sequence of cluster combinations is performed, at each step combining the most similar two clusters, where the most similar two clusters are the clusters related by the most similar pair of document vectors, into a new cluster. The process can be halted before all clusters are combined based on application-specific criteria.

Citations

15 Claims

1. ) A method for clustering hierarchically related information, where the information can be represented as a set of nodes wherein each node is associated with a portion of the information and the nodes are connected by directed edges wherein a parent node is the source of an- incoming edge, a child node is the target of an outgoing edge, sibling nodes have the same parent node, and a path is a sequence of nodes, comprising:
- a) associating a cluster having a size with each node, b) identifying related node pairs, c) determining a node distance for each node pair, d) determining if there is more than one cluster remaining, e) responsive to the determination in step d, selecting a pair of clusters to combine and determining their cluster distance, f) determining whether the cluster distance of the selected cluster pair is less than a maximum distance and whether the sum of the sizes of the two clusters of the selected cluster pair is less than a specified halting cluster size, g) responsive to the determination in step f, combining the selected cluster pair into a combined cluster and associating the combined cluster with each of the nodes previously associated with each of the individual clusters in the selected cluster pair, and h) repeating steps e, f, and g until the selected cluster pair has a cluster distance greater than a maximum distance or the sum of the sizes of the two clusters of the cluster pair is greater than a specified halting cluster size.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
- - 2. ) The method in claim 1 wherein the information associated with each node represents text of an electronic mail message.
  - 3. ) The method of claim 2 further comprising:
    - modifying the information by normalizing the quoting styles of each electronic mail message prior to the step of determining a node distance for each node pair.
  - 4. ) The method of claim 1 wherein determining a node distance for each node pair comprises a) determining a word vector for each node based on the portion of the information represented by that node, and b) determining the node distance based on the distance between the word vectors.
  - 5. ) The method of claim 1 wherein identifying related node pairs comprises identifying nodes related along a path.
  - 6. ) The method of claim 1 wherein identifying related node pairs comprises identifying node pairs related in a parent child relation.
  - 7. ) The method of claim 1 wherein identifying related node pairs comprises identifying node pairs related in a sibling relation.
  - 8. ) The method of claim 1 wherein selecting a pair of clusters to combine and determining their cluster distance comprises:
    - a) determining a cluster distance for each pair of clusters wherein a first cluster is associated with the first node of a node pair and a second cluster is associated with the second node of a node pair, from the node distances of all node pairs linking each cluster in the pair of clusters, and a centroid distance between the two clusters in the pair of clusters. b) determining which pair of clusters has the smallest cluster distance, and c) selecting the pair of clusters having the smallest distance.
  - 9. ) The method of claim 8 wherein computing a centroid distance between two clusters comprises finding a lexical centroid for each cluster and then determining a distance between the two lexical centroids.
  - 10. ) The method of claim 1 wherein selecting a pair of clusters to combine and determining their cluster distance comprises:
    - a) determining a cluster distance for each pair of clusters wherein a first cluster is associated with the first node of a node pair and a second cluster is associated with the second node of a node pair, from the minimum node distance of all node pairs linking each cluster in the pair of clusters, and a centroid distance between the two clusters in the pair of clusters, b) determining which pair of clusters has the smallest cluster distance, and c) selecting the pair of clusters having the smallest distance.
  - 11. ) The method of claim 1 wherein selecting a pair of clusters to combine and determining their cluster distance comprises a) selecting the node pair with the smallest node distance whose members are in two different clusters, b) selecting the two clusters associated with the nodes of the selected node pair, and setting the cluster distance as the node distance of the selected node pair.
  - 12. ) The method of claim 1 further comprising after step h:
    - i) determining whether the selected cluster pair has a cluster distance less than the maximum distance and a cluster size less than a specified maximum cluster size, and j) responsive to the determination in step j, combining the selected cluster pair into a combined cluster and associating the combined cluster with each of the nodes previously associated with each of the individual clusters in the selected cluster pair.
  - 13. ) The method of claim 1 further comprising after step h:
    - a) designating clusters larger than a given size as primary clusters and clusters smaller than a given size as secondary clusters, and b) combining adjacent secondary clusters.
  - 14. ) The method of claim 13 further comprising combining any secondary clusters smaller than a given size with an adjacent primary cluster.
  - 15. ) The method of claim 1 wherein selecting a pair of clusters to combine and determining their cluster distance comprises:
    - a) determining a cluster distance for each pair of clusters wherein a first cluster is associated with the first node of a node pair and a second cluster is associated with the second node of a node pair, from the largest node distance of all node pairs linking each cluster in the pair of clusters, and a centroid distance between the two clusters in the pair of clusters, b) determining which pair of clusters has the smallest cluster distance, and c) selecting the pair of clusters having the smallest distance.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Palo Alto Research Center, Inc. (Xerox Holdings Corp.)
Original Assignee
Palo Alto Research Center, Inc. (Xerox Holdings Corp.)
Inventors
Chen, Francine R., Newman, Paula S.

Granted Patent

US 7,007,069 B2
Time in Patent Office

Days
Field of Search
US Class Current

709/206
CPC Class Codes

G06F 16/353 into predefined classes

Method and apparatus for clustering hierarchically related information

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

15 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for clustering hierarchically related information

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

15 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links