Latent semantic clustering
First Claim
1. A computer-based method for automatically identifying clusters of conceptually-related documents in a collection of documents, comprising:
- (a) generating a document-representation of each document in an abstract mathematical space;
(b) identifying a plurality of document clusters in the collection of documents based on a conceptual similarity between respective pairs of the document-representations, wherein each document cluster is associated with an exemplary document and a plurality of other documents; and
(c) identifying a non-intersecting document cluster from among the plurality of document clusters based on (i) a conceptual similarity between the document-representation of the exemplary document and the document-representation of each document in the non-intersecting cluster and (ii) a conceptual dissimilarity between a cluster-representation of the non-intersecting document cluster and a cluster-representation of each other document cluster, wherein step (c) comprises,(c1) identifying a non-intersecting document cluster from among the plurality of document clusters if (i) a conceptual similarity between the document-representation of the exemplary document and the document-representation of each document in the non-intersecting cluster is above a predefined similarity threshold and (ii) a conceptual dissimilarity between a cluster-representation of the non-intersecting document cluster and a cluster-representation of each other document cluster is above a predefined dissimilarity threshold; and
(d) iteratively adjusting the predefined similarity threshold from a maximum similarity level to a minimum similarity level via a predefined similarity increment;
(e) iteratively adjusting the predefined dissimilarity threshold from a minimum dissimilarity level to a maximum dissimilarity level via a predefined dissimilarity increment; and
(f) repeating step (c1) for each similarity level and each dissimilarity level.
4 Assignments
0 Petitions
Accused Products
Abstract
An embodiment of the present invention provides a computer-based method for automatically identifying clusters of conceptually-related documents in a collection of documents, including the following steps: generating a document-representation of each document in an abstract mathematical space; identifying a plurality of document clusters in the collection of documents based on a conceptual similarity between respective pairs of the document-representations, wherein each document cluster is associated with an exemplary document and a plurality of other documents; and identifying a non-intersecting document cluster from among the plurality of document clusters based on (i) a conceptual similarity between the document-representation of the exemplary document and the document-representation of each document in the non-intersecting cluster and (ii) a conceptual dissimilarity between a cluster-representation of the non-intersecting document cluster and a cluster-representation of each other document cluster. Variants of the method enable creating hierarchy of clusters and conducting incremental updates of preexisting hierarchical structures.
127 Citations
8 Claims
-
1. A computer-based method for automatically identifying clusters of conceptually-related documents in a collection of documents, comprising:
-
(a) generating a document-representation of each document in an abstract mathematical space; (b) identifying a plurality of document clusters in the collection of documents based on a conceptual similarity between respective pairs of the document-representations, wherein each document cluster is associated with an exemplary document and a plurality of other documents; and (c) identifying a non-intersecting document cluster from among the plurality of document clusters based on (i) a conceptual similarity between the document-representation of the exemplary document and the document-representation of each document in the non-intersecting cluster and (ii) a conceptual dissimilarity between a cluster-representation of the non-intersecting document cluster and a cluster-representation of each other document cluster, wherein step (c) comprises, (c1) identifying a non-intersecting document cluster from among the plurality of document clusters if (i) a conceptual similarity between the document-representation of the exemplary document and the document-representation of each document in the non-intersecting cluster is above a predefined similarity threshold and (ii) a conceptual dissimilarity between a cluster-representation of the non-intersecting document cluster and a cluster-representation of each other document cluster is above a predefined dissimilarity threshold; and (d) iteratively adjusting the predefined similarity threshold from a maximum similarity level to a minimum similarity level via a predefined similarity increment; (e) iteratively adjusting the predefined dissimilarity threshold from a minimum dissimilarity level to a maximum dissimilarity level via a predefined dissimilarity increment; and (f) repeating step (c1) for each similarity level and each dissimilarity level. - View Dependent Claims (2, 3, 4)
-
-
5. A computer program product for automatically identifying clusters of conceptually-related documents in a collection of documents, comprising:
-
a computer usable medium having computer readable program code embodied in said medium for causing an application program to execute on an operating system of a computer, said computer readable program code comprising; computer readable first program code that causes the computer to generate a document-representation of each document in an abstract mathematical space; computer readable second program code that causes the computer to identify a plurality of document clusters in the collection of documents based on a conceptual similarity between respective pairs of the document-representations, wherein each document cluster includes an exemplary document and a plurality of other documents; and computer readable third program code that causes the computer to identify a non- intersecting document cluster from among the plurality of document clusters based on (i) a conceptual similarity between the document-representation of the exemplary document and the document-representation of each document in the non-intersecting cluster and (ii) a conceptual dissimilarity between a cluster-representation of the non-intersecting document cluster and a cluster-representation of each other document cluster, wherein the computer readable third program code comprises, code that causes the computer to identify a non-intersecting document cluster from among the plurality of document clusters if (i) a conceptual similarity between the document-representation of the exemplary document and the document-representation of each document in the non-intersecting cluster is above a predefined similarity threshold and (ii) a conceptual dissimilarity between a cluster-representation of the non-intersecting document cluster and a cluster-representation of each other document cluster is above a predefined dissimilarity threshold; and computer readable fourth program code that causes the computer to iteratively adjust the predefined similarity threshold from a maximum similarity level to a minimum similarity level via a predefined similarity increment; computer readable fifth program code that causes the computer to iteratively adjust the predefined dissimilarity threshold from a minimum dissimilarity level to a maximum dissimilarity level via a predefined dissimilarity increment; and computer readable sixth program code that causes the computer to repeat the third computer readable program code means for each similarity level and each dissimilarity level. - View Dependent Claims (6, 7, 8)
-
Specification