Document clustering that applies a locality sensitive hashing function to a feature vector to obtain a limited set of candidate clusters
First Claim
1. A method of clustering a plurality of documents from a data stream comprising:
- generating a feature vector for a document in the plurality of documents;
applying a locality sensitive hashing function to the feature vector;
retrieving a set of cluster centroids based on a result of the applied locality sensitive hashing function of the feature vector;
determining a distance between the feature vector of the document and each of the cluster centroids; and
assigning the document to a cluster based on the determined distances.
2 Assignments
0 Petitions
Accused Products
Abstract
Documents from a data stream are clustered by first generating a feature vector for each document. A set of cluster centroids (e.g., feature vectors of their corresponding clusters) are retrieved from a memory based on the feature vector of the document using a locality sensitive hashing function. The centroids may be retrieved by retrieving a set of cluster identifiers from a cluster table, the cluster identifiers each indicative of a respective cluster centroid, and retrieving the cluster centroids corresponding to the retrieved cluster identifiers from a memory. Documents may then be clustered into one or more of the candidate clusters using distance measures from the feature vector of the document to the cluster centroids.
-
Citations
24 Claims
-
1. A method of clustering a plurality of documents from a data stream comprising:
-
generating a feature vector for a document in the plurality of documents; applying a locality sensitive hashing function to the feature vector; retrieving a set of cluster centroids based on a result of the applied locality sensitive hashing function of the feature vector; determining a distance between the feature vector of the document and each of the cluster centroids; and assigning the document to a cluster based on the determined distances. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. An apparatus for clustering a plurality of documents from a data stream comprising:
-
means for generating a feature vector for a document in the plurality of documents; means for applying a locality sensitive hashing function to the feature vector; means for retrieving a set of cluster centroids based on a result of the applied locality sensitive hashing function of the feature vector; means for determining a distance between the feature vector of the document and each of the cluster centroids; and means for assigning the document to a cluster based on the determined distances. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
-
-
17. A machine readable medium having program instructions stored thereon, the instructions capable of execution by a processor and defining the steps of:
-
generating a feature vector for a document in a plurality of documents; applying a locality sensitive hashing function to the feature vector; retrieving a set of cluster centroids based on a result of the applied locality sensitive hashing function of the feature vector; determining a distance between the feature vector of the document and each of the cluster centroids; and assigning the document to a cluster based on the determined distances. - View Dependent Claims (18, 19, 20, 21, 22, 23, 24)
-
Specification