Information exploration systems and method
First Claim
Patent Images
1. A computer-implemented information exploration method that comprises:
- processing a set of documents with a computer to identify a hierarchy of clusters of documents,wherein the processing comprises;
calculating a pseudo-document vector for each document in the set of documents; and
computing the hierarchy of clusters from the pseudo-document vectors; and
selecting with a computer one or more phrases from the set of documents as representative phrases for each cluster from a root cluster to leaf clusters in the hierarchy of clusters,wherein the selecting comprises;
constructing a phrase-to-leaf node index for the hierarchy of clusters, wherein the phrase-to-leaf node index includes a list of phrases that occur in at least a predetermined number of documents of at least one of the leaf clusters, and for each phrase in the list of phrases, the phrase-to-leaf node index identifies each of the leaf clusters containing the phrase; and
wherein constructing the phrase-to-leaf node index further comprises;
constructing a suffix tree for each leaf cluster in the hierarchy of clusters;
constructing a phrase index for each leaf cluster that includes a list of phrases shared by at least the predetermined number of documents in the leaf cluster; and
combining the phrase indices of the leaf clusters in the hierarchy of cluster to construct the phrase-to-leaf node index.
22 Assignments
0 Petitions
Accused Products
Abstract
Disclosed information exploration system and method embodiments operate on a document set to determine a document cluster hierarchy. An exclusionary phrase index is determined for each cluster, and representative phrases are selected from the indexes. The selection process may enforce pathwise uniqueness and balanced sub-cluster representation. The representative phrases may be used as cluster labels in an interactive information exploration interface.
369 Citations
26 Claims
-
1. A computer-implemented information exploration method that comprises:
-
processing a set of documents with a computer to identify a hierarchy of clusters of documents, wherein the processing comprises; calculating a pseudo-document vector for each document in the set of documents; and computing the hierarchy of clusters from the pseudo-document vectors; and selecting with a computer one or more phrases from the set of documents as representative phrases for each cluster from a root cluster to leaf clusters in the hierarchy of clusters, wherein the selecting comprises; constructing a phrase-to-leaf node index for the hierarchy of clusters, wherein the phrase-to-leaf node index includes a list of phrases that occur in at least a predetermined number of documents of at least one of the leaf clusters, and for each phrase in the list of phrases, the phrase-to-leaf node index identifies each of the leaf clusters containing the phrase; and wherein constructing the phrase-to-leaf node index further comprises; constructing a suffix tree for each leaf cluster in the hierarchy of clusters; constructing a phrase index for each leaf cluster that includes a list of phrases shared by at least the predetermined number of documents in the leaf cluster; and combining the phrase indices of the leaf clusters in the hierarchy of cluster to construct the phrase-to-leaf node index. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
-
-
15. An information exploration system that comprises:
-
a display; a user input device; a memory that stores software; and a processor coupled to the memory to execute the software, wherein the software configures the processor to interact with a user via the display and user input device, and wherein the software further configures the processor to; process a set of documents to identify a hierarchy of mutually exclusive clusters of documents; determine a phrase index for the hierarchy of clusters, wherein the phrase index includes a list of phrases that occur in at least a predetermined number of documents of at least one leaf clusters in the hierarchy of mutually exclusive clusters, and for each phrase in the list of phrases, the phrase index identifies each leaf cluster containing the phrase; select one or more phrases from the phrase index as representative phrases for each cluster in the hierarchy of clusters, wherein phrases of the phrase index that have already been selected as a representative phrase of a previous cluster in a given path in the hierarchy of clusters are removed from the phrase index for other clusters in the path; and display the representative phrases as cluster labels on the display. - View Dependent Claims (16, 17, 18, 19)
-
-
20. Application instructions on an information storage medium, wherein the instructions, when executed, effect an information exploration interface, the application instructions comprising:
-
a clustering process that determines a hierarchy of clusters for a set of documents by initially treating the set of documents as a single cluster and iteratively dividing the single cluster and sub-clusters created by the dividing until stopping criteria is met; a cluster-naming process that selects representative phrases for each cluster in the hierarchy of clusters, wherein the cluster-naming process selects representative phrases by; determining a phrase index for the hierarchy of clusters, wherein the phrase index includes a list of phrases that occur in at least a predetermined number of documents of at least one leaf clusters in the hierarchy of clusters, and for each phrase in the list of phrases, the phrase index identifies each leaf cluster containing the phrase; scoring each phrase in the phrase index; and selecting representative phrases from the phrase index, wherein the representative phrases for any given cluster are indicated by the phrase index to be absent from any leaf nodes that are not descendants of the given cluster; and an exploration process that interactively displays the representative phrases from the phrase index to a user. - View Dependent Claims (21, 22, 23, 24)
-
-
25. A computer-implemented information exploration method that comprises:
-
processing a set of documents with a computer to identify a hierarchy of mutually exclusive clusters of documents; creating a phrase index for the hierarchy of mutually exclusive clusters with a computer, wherein the phrase index includes a list of phrases that occur in at least a predetermined number of documents of at least one leaf clusters in the hierarchy of mutually exclusive clusters, and for each phrase in the list of phrases, the phrase index identifies each leaf cluster containing the phrase; and selecting with a computer one or more phrases from the phrase index as representative phrases for each cluster in the hierarchy of mutually exclusive clusters, wherein the representative phrases for any given cluster are indicated by the phrase index to be absent from clusters that are not descendants of the given cluster in the hierarchy of mutually exclusive clusters. - View Dependent Claims (26)
-
Specification