Latent semantic clustering
First Claim
1. A computer-based method for automatically identifying clusters of conceptually-related documents in a collection of documents, comprising:
- (a) generating a document-representation of each document in an abstract mathematical space;
(b) identifying a plurality of document clusters in the collection of documents based on a conceptual similarity between respective pairs of the document-representations, wherein each document cluster is associated with an exemplary document and a plurality of other documents; and
(c) identifying a non-intersecting document cluster from among the plurality of document clusters based on (i) a conceptual similarity between the document-representation of the exemplary document and the document-representation of each document in the non-intersecting cluster and (ii) a conceptual dissimilarity between a cluster-representation of the non-intersecting document cluster and a cluster-representation of each other document cluster.
4 Assignments
0 Petitions
Accused Products
Abstract
An embodiment of the present invention provides a computer-based method for automatically identifying clusters of conceptually-related documents in a collection of documents, including the following steps: generating a document-representation of each document in an abstract mathematical space; identifying a plurality of document clusters in the collection of documents based on a conceptual similarity between respective pairs of the document-representations, wherein each document cluster is associated with an exemplary document and a plurality of other documents; and identifying a non-intersecting document cluster from among the plurality of document clusters based on (i) a conceptual similarity between the document-representation of the exemplary document and the document-representation of each document in the non-intersecting cluster and (ii) a conceptual dissimilarity between a cluster-representation of the non-intersecting document cluster and a cluster-representation of each other document cluster. Variants of the method enable creating hierarchy of clusters and conducting incremental updates of preexisting hierarchical structures.
94 Citations
29 Claims
-
1. A computer-based method for automatically identifying clusters of conceptually-related documents in a collection of documents, comprising:
-
(a) generating a document-representation of each document in an abstract mathematical space;
(b) identifying a plurality of document clusters in the collection of documents based on a conceptual similarity between respective pairs of the document-representations, wherein each document cluster is associated with an exemplary document and a plurality of other documents; and
(c) identifying a non-intersecting document cluster from among the plurality of document clusters based on (i) a conceptual similarity between the document-representation of the exemplary document and the document-representation of each document in the non-intersecting cluster and (ii) a conceptual dissimilarity between a cluster-representation of the non-intersecting document cluster and a cluster-representation of each other document cluster. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A computer program product for automatically identifying clusters of conceptually-related documents in a collection of documents, comprising:
-
a computer usable medium having computer readable program code embodied in said medium for causing an application program to execute on an operating system of a computer, said computer readable program code comprising;
computer readable first program code that causes the computer to generate a document-representation of each document in an abstract mathematical space;
computer readable second program code that causes the computer to identify a plurality of document clusters in the collection of documents based on a conceptual similarity between respective pairs of the document-representations, wherein each document cluster includes an exemplary document and a plurality of other documents; and
computer readable third program code that causes the computer to identify a non-intersecting document cluster from among the plurality of document clusters based on (i) a conceptual similarity between the document-representation of the exemplary document and the document-representation of each document in the non-intersecting cluster and (ii) a conceptual dissimilarity between a cluster-representation of the non-intersecting document cluster and a cluster-representation of each other document cluster. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
-
-
19. A computer-based method for automatically identifying clusters of conceptually-related documents in a collection of documents, comprising:
-
(a) generating a document-representation of each document in an abstract mathematical space;
(b) identifying a plurality of document clusters in the collection of documents based on a conceptual similarity between respective pairs of the document-representations, wherein each document cluster includes a plurality of documents;
(c) computing an intra-cluster conceptual similarity for each document cluster based on the document-representations of the plurality of documents included in each cluster;
(d) computing inter-cluster conceptual dissimilarities between pairs of document clusters in the plurality of document clusters; and
(e) identifying a non-intersecting document cluster from among the plurality of document clusters based on (i) the intra-cluster conceptual similarities and (ii) the inter-cluster conceptual dissimilarities.
-
-
20. A computer-based method for automatically organizing documents in a collection of documents into clusters of documents, comprising:
-
(a) generating a representation of each document in an abstract mathematical space;
(b) measuring a similarity between the representation of each document in the collection of documents and the representation of at least one other document in the collection of documents;
(c) labeling each document in the collection of documents with a first mapping or a second mapping based on the similarity measurements; and
(d) organizing the documents into clusters based on the mappings. - View Dependent Claims (21, 22, 23, 24)
-
-
25. A computer program product for automatically organizing documents in a collection of documents into clusters of documents, comprising:
-
a computer usable medium having computer readable program code embodied in said medium for causing an application program to execute on an operating system of a computer, said computer readable program code comprising;
computer readable first program code that causes the computer to generate a representation of each document in an abstract mathematical space;
computer readable second program code that causes the computer to measure a similarity between the representation of each document in the collection of documents and the representation of at least one other document in the collection of documents;
computer readable third program code that causes the computer to label each document in the collection of documents with a first mapping or a second mapping based on the similarity measurements; and
computer readable fourth program code that causes the computer to organize the documents into clusters based on the mappings. - View Dependent Claims (26, 27, 28, 29)
-
Specification