Unsupervised document clustering using latent semantic density analysis
First Claim
1. A computer-implemented method for clustering documents, comprising:
- at a device comprising one or more processors and memory;
generating a latent semantic mapping (LSM) space from a collection of a plurality of documents, the LSM space includes a plurality of document vectors, each representing one of the documents in the collection;
identifying a plurality of centroid document vectors from the plurality of document vectors;
forming a plurality of document groups each including a respective group of document vectors in the LSM space that are within a predetermined hypersphere diameter from a respective one of the plurality of centroid document vectors, wherein the predetermined hypersphere diameter represents a predetermined closeness measure among the document vectors in the LSM space; and
selectively designating a particular document group from the plurality of document groups as a document cluster based on the particular document group containing a maximum number of document vectors among the plurality of document groups.
1 Assignment
0 Petitions
Accused Products
Abstract
According to one embodiment, a latent semantic mapping (LSM) space is generated from a collection of a plurality of documents, where the LSM space includes a plurality of document vectors, each representing one of the documents in the collection. For each of the document vectors considered as a centroid document vector, a group of document vectors is identified in the LSM space that are within a predetermined hypersphere diameter from the centroid document vector. As a result, multiple groups of document vectors are formed. The predetermined hypersphere diameter represents a predetermined closeness measure among the document vectors in the LSM space. Thereafter, a group from the plurality of groups is designated as a cluster of document vectors, where the designated group contains a maximum number of document vectors among the plurality of groups.
717 Citations
24 Claims
-
1. A computer-implemented method for clustering documents, comprising:
at a device comprising one or more processors and memory; generating a latent semantic mapping (LSM) space from a collection of a plurality of documents, the LSM space includes a plurality of document vectors, each representing one of the documents in the collection; identifying a plurality of centroid document vectors from the plurality of document vectors; forming a plurality of document groups each including a respective group of document vectors in the LSM space that are within a predetermined hypersphere diameter from a respective one of the plurality of centroid document vectors, wherein the predetermined hypersphere diameter represents a predetermined closeness measure among the document vectors in the LSM space; and selectively designating a particular document group from the plurality of document groups as a document cluster based on the particular document group containing a maximum number of document vectors among the plurality of document groups. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
9. A non-transitory machine-readable storage medium having instructions stored thereon, which when executed by a machine, cause the machine to perform a method for clustering documents, the method comprising:
-
generating a latent semantic mapping (LSM) space from a collection of a plurality of documents, the LSM space includes a plurality of document vectors, each representing one of the documents in the collection; identifying a plurality of centroid document vectors from the plurality of document vectors; forming a plurality of document groups each including a respective group of document vectors in the LSM space that are within a predetermined hypersphere diameter from a respective one of the plurality of centroid document vectors, wherein the predetermined hypersphere diameter represents a predetermined closeness measure among the document vectors in the LSM space; and selectively designating a particular document group from the plurality of groups as a document cluster-based on the particular document group containing a maximum number of document vectors among the plurality of document groups. - View Dependent Claims (10, 11, 12, 14, 15, 16)
-
-
13. The machine-readable storage medium, wherein the predetermined hypersphere diameter is selected from a range of hypersphere diameters having incremental size in sequence, and wherein the predetermined hypersphere diameter is identified when a difference in numbers of document vectors in two adjacent hypersphere diameters in the range reaches the maximum.
-
17. A data processing system, comprising:
-
one or more processors; and a memory coupled to the one or more processors and storing instructions, which when executed by the one or more processors, cause the processors to; generate a latent semantic mapping (LSM) space from a collection of a plurality of documents, the LSM space includes a plurality of document vectors, each representing one of the documents in the collection, identify a plurality of centroid document vectors from the plurality of document vectors; form a plurality of document clusters each including a respective group of document vectors in the LSM space that are within a predetermined hypersphere diameter from the centroid document vector, wherein the predetermined hypersphere diameter represents a predetermined closeness measure among the document vectors in the LSM space, and selectively designate a particular document group from the plurality of document groups as a document cluster based on the particular document group containing a maximum number of document vectors among the plurality of document groups.
-
-
18. A computer-implemented method for classifying a document, comprising:
at a device comprising one or more processors and memory; in response to receiving a new document to be classified, mapping the new document into a new document vector in a latent semantic mapping (LSM) space, the LSM space having one or more semantic anchors representing one or more document clusters, wherein each of the one or more document clusters is generated based on a respective iteration of an iterative process performed on a given collection of document vectors, wherein, during the respective iteration, a particular document group from a plurality of document groups is selectively designated as the document cluster based on the particular document group containing a maximum number of document vectors among the plurality of document groups, and wherein each of the plurality of document groups includes a respective group of document vectors within a predetermined closeness measure of a respective one of a plurality of centroid document vectors in the LSM space; determining a closeness distance between the new document vector and each of the semantic anchors in the LSM space; and classifying the new document as a member of one or more of the document clusters if the closeness distance between the new document vector and one or more corresponding semantic anchors is within a predetermined threshold. - View Dependent Claims (19)
-
20. A computer-implemented method for clustering documents, comprising:
at a device comprising one or more processors and memory; selecting a hypersphere diameter as a current hypersphere diameter from a range of a plurality of hypersphere diameters in a latent semantic mapping (LSM) space, the LSM space having a plurality of document vectors, each representing one of a plurality of documents of a collection; and for each of the document vectors in the LSM space considered as a centroid document vector, iteratively performing the following; identifying a document group in the LSM space, the document group including a respective group of document vectors that are within the current hypersphere diameter from the centroid document vector, calculating a ratio between a first number of document vectors of the identified document group associated with the current hypersphere diameter and a second number of document vectors of a document group associated with a previous hypersphere diameter, adjusting the current hypersphere diameter by a predetermined value, repeating the identifying and calculating operations one or more times to form a plurality of document groups, and selectively designating a particular document group associated with a maximum ratio among the calculated plurality of ratios as an initial cluster candidate. - View Dependent Claims (21, 22, 23, 24)
Specification