UNSUPERVISED DOCUMENT CLUSTERING USING LATENT SEMANTIC DENSITY ANALYSIS
First Claim
1. A computer-implemented method for clustering documents, comprising:
- generating a latent semantic mapping (LSM) space from a collection of a plurality of documents, the LSM space includes a plurality of document vectors, each representing one of the documents in the collection;
for each of the document vectors considered as a centroid document vector, identifying a group of document vectors in the LSM space that are within a predetermined hypersphere diameter from the centroid document vector, forming a plurality of groups of document vectors, wherein the predetermined hypersphere diameter represents a predetermined closeness measure among the document vectors in the LSM space; and
designating a group from the plurality of groups as a cluster of document vectors, wherein the designated group contains a maximum number of document vectors among the plurality of groups.
1 Assignment
0 Petitions
Accused Products
Abstract
According to one embodiment, a latent semantic mapping (LSM) space is generated from a collection of a plurality of documents, where the LSM space includes a plurality of document vectors, each representing one of the documents in the collection. For each of the document vectors considered as a centroid document vector, a group of document vectors is identified in the LSM space that are within a predetermined hypersphere diameter from the centroid document vector. As a result, multiple groups of document vectors are formed. The predetermined hypersphere diameter represents a predetermined closeness measure among the document vectors in the LSM space. Thereafter, a group from the plurality of groups is designated as a cluster of document vectors, where the designated group contains a maximum number of document vectors among the plurality of groups.
56 Citations
24 Claims
-
1. A computer-implemented method for clustering documents, comprising:
-
generating a latent semantic mapping (LSM) space from a collection of a plurality of documents, the LSM space includes a plurality of document vectors, each representing one of the documents in the collection; for each of the document vectors considered as a centroid document vector, identifying a group of document vectors in the LSM space that are within a predetermined hypersphere diameter from the centroid document vector, forming a plurality of groups of document vectors, wherein the predetermined hypersphere diameter represents a predetermined closeness measure among the document vectors in the LSM space; and designating a group from the plurality of groups as a cluster of document vectors, wherein the designated group contains a maximum number of document vectors among the plurality of groups. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A machine-readable storage medium having instructions stored therein, which when executed by a machine, cause the machine to perform a method for clustering documents, the method comprising:
-
generating a latent semantic mapping (LSM) space from a collection of a plurality of documents, the LSM space includes a plurality of document vectors, each representing one of the documents in the collection; for each of the document vectors considered as a centroid document vector, identifying a group of document vectors in the LSM space that are within a predetermined hypersphere diameter from the centroid document vector, forming a plurality of groups of document vectors, wherein the predetermined hypersphere diameter represents a predetermined closeness measure among the document vectors in the LSM space; and designating a group from the plurality of groups as a cluster of document vectors, wherein the designated group contains a maximum number of document vectors among the plurality of groups. - View Dependent Claims (10, 11, 12, 14, 15, 16)
-
-
13. The machine-readable storage medium, wherein the predetermined hypersphere diameter is selected from a range of hypersphere diameters having incremental size in sequence, and wherein the predetermined hypersphere diameter is identified when a difference in numbers of document vectors in two adjacent hypersphere diameters in the range reaches the maximum.
-
17. A data processing system, comprising:
-
a processor; and a memory coupled to the processor for storing instructions, which when executed from the memory, cause the processor to generate a latent semantic mapping (LSM) space from a collection of a plurality of documents, the LSM space includes a plurality of document vectors, each representing one of the documents in the collection, for each of the document vectors considered as a centroid document vector, identify a group of document vectors in the LSM space that are within a predetermined hypersphere diameter from the centroid document vector, forming a plurality of groups of document vectors, wherein the predetermined hypersphere diameter represents a predetermined closeness measure among the document vectors in the LSM space, and designate a group from the plurality of groups as a cluster of document vectors, wherein the designated group contains a maximum number of document vectors among the plurality of groups.
-
-
18. A computer-implemented method for classifying a document, comprising:
-
in response to a new document to be classified, mapping the new document into a new document vector in a latent semantic mapping (LSM) space, the LSM space having one or more semantic anchors representing one or more clusters of document vectors, wherein each of the one or more clusters is generated based on a given collection of document vectors in which a group having a maximum number of document vectors within a predetermined closeness measure in the LSM space is designated as one of the one or more clusters; determining a closeness distance between the new document vector and each of the semantic anchors in the LSM space; and classifying the new document as a member of one or more of the clusters if the closeness distance between the new document vector and one or more corresponding semantic anchors is within a predetermined threshold. - View Dependent Claims (19)
-
-
20. A computer-implemented method for clustering documents, comprising:
-
selecting a hypersphere diameter as a current hypersphere diameter from a range of a plurality of hypersphere diameters in a latent semantic mapping (LSM) space, the LSM space having a plurality of document vectors, each representing one of a plurality of documents of a collection; and for each of the document vectors in the LSM space considered as a centroid document vector, iteratively performing the following; identifying a group of document vectors in the LSM space that are within the current hypersphere diameter from the centroid document vector, calculating a ratio between a first number of document vectors of the identified group associated with the current hypersphere diameter and a second number of document vectors of a group associated with a previous hypersphere diameter, adjusting the current hypersphere diameter by a predetermined value, repeating the identifying and calculating operations, forming a plurality of groups of document vectors, and designating a group associated with a maximum ratio among the calculated plurality of ratios as a cluster candidate. - View Dependent Claims (21, 22, 23, 24)
-
Specification