Document clustering that applies a locality sensitive hashing function to a feature vector to obtain a limited set of candidate clusters
First Claim
1. A method of clustering a plurality of documents from a data stream comprising:
- generating a feature vector for a document in the plurality of documents;
applying a locality sensitive hashing function to the feature vector;
retrieving a set of cluster centroids based on a result of the applied locality sensitive hashing function of the feature vector;
determining a distance between the feature vector of the document and each of the cluster centroids; and
assigning the document to a cluster based on the determined distances.
2 Assignments
0 Petitions
Accused Products
Abstract
Documents from a data stream are clustered by first generating a feature vector for each document. A set of cluster centroids (e.g., feature vectors of their corresponding clusters) are retrieved from a memory based on the feature vector of the document using a locality sensitive hashing function. The centroids may be retrieved by retrieving a set of cluster identifiers from a cluster table, the cluster identifiers each indicative of a respective cluster centroid, and retrieving the cluster centroids corresponding to the retrieved cluster identifiers from a memory. Documents may then be clustered into one or more of the candidate clusters using distance measures from the feature vector of the document to the cluster centroids.
285 Citations
24 Claims
-
1. A method of clustering a plurality of documents from a data stream comprising:
-
generating a feature vector for a document in the plurality of documents; applying a locality sensitive hashing function to the feature vector; retrieving a set of cluster centroids based on a result of the applied locality sensitive hashing function of the feature vector; determining a distance between the feature vector of the document and each of the cluster centroids; and assigning the document to a cluster based on the determined distances. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. An apparatus for clustering a plurality of documents from a data stream comprising:
-
means for generating a feature vector for a document in the plurality of documents; means for applying a locality sensitive hashing function to the feature vector; means for retrieving a set of cluster centroids based on a result of the applied locality sensitive hashing function of the feature vector; means for determining a distance between the feature vector of the document and each of the cluster centroids; and means for assigning the document to a cluster based on the determined distances. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
-
-
17. A machine readable medium having program instructions stored thereon, the instructions capable of execution by a processor and defining the steps of:
-
generating a feature vector for a document in a plurality of documents; applying a locality sensitive hashing function to the feature vector; retrieving a set of cluster centroids based on a result of the applied locality sensitive hashing function of the feature vector; determining a distance between the feature vector of the document and each of the cluster centroids; and assigning the document to a cluster based on the determined distances. - View Dependent Claims (18, 19, 20, 21, 22, 23, 24)
-
Specification