Document clustering that applies a locality sensitive hashing function to a feature vector to obtain a limited set of candidate clusters

US 7,797,265 B2
Filed: 02/25/2008
Issued: 09/14/2010
Est. Priority Date: 02/26/2007
Status: Expired due to Fees

First Claim

Patent Images

1. A method of clustering a plurality of documents from a data stream comprising:

generating a feature vector for a document in the plurality of documents;

applying a locality sensitive hashing function to the feature vector;

retrieving a set of cluster centroids based on a result of the applied locality sensitive hashing function of the feature vector;

determining a distance between the feature vector of the document and each of the cluster centroids; and

assigning the document to a cluster based on the determined distances.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Documents from a data stream are clustered by first generating a feature vector for each document. A set of cluster centroids (e.g., feature vectors of their corresponding clusters) are retrieved from a memory based on the feature vector of the document using a locality sensitive hashing function. The centroids may be retrieved by retrieving a set of cluster identifiers from a cluster table, the cluster identifiers each indicative of a respective cluster centroid, and retrieving the cluster centroids corresponding to the retrieved cluster identifiers from a memory. Documents may then be clustered into one or more of the candidate clusters using distance measures from the feature vector of the document to the cluster centroids.

Citations

24 Claims

1. A method of clustering a plurality of documents from a data stream comprising:
- generating a feature vector for a document in the plurality of documents;
  
  applying a locality sensitive hashing function to the feature vector;
  
  retrieving a set of cluster centroids based on a result of the applied locality sensitive hashing function of the feature vector;
  
  determining a distance between the feature vector of the document and each of the cluster centroids; and
  
  assigning the document to a cluster based on the determined distances.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1 wherein the feature vector is {right arrow over (d)}=[d₁, . . . , d_T] and applying a locality sensitive hashing function to the feature vector comprises:
    - calculating entries in a hash value
  - 3. The method of claim 1 wherein applying a locality sensitive hashing function to the feature vector comprises:
    - hashing the feature vector with a locality sensitive hashing function comprising a series of hashing functions;
      
      determining a set of values similar to a set of results of the hashing of the feature vector; and
      
      determining a set of cluster identifiers based on the determined set of values.
  - 4. The method of claim 1 wherein retrieving a set of cluster centroids based on a result of the applied locality sensitive hashing function of the feature vector further comprises:
    - selecting a group of cluster centroids based on a size of clusters associated with the cluster centroids.
  - 5. The method of claim 1 wherein assigning the document to a cluster based on the determined distances comprises:
    - if the determined distances between the feature vector of the document and a plurality of cluster centroids are below a predetermined threshold;
      
      assigning the document to clusters corresponding to that plurality of cluster centroids;
      
      updating the cluster centroids based on the feature vector of the document; and
      
      updating the relative ages of the cluster centroids.
  - 6. The method of claim 1 wherein assigning the document to a cluster based on the determined distances comprises:
    - if a determined distance between the feature vector of the document and a particular cluster centroid is below a predetermined threshold and below the determined distances between the feature vector of the document and the other centroids;
      
      assigning the document to a cluster associated with the particular cluster centroid;
      
      updating the particular cluster centroid based on the feature vector of the document; and
      
      updating the relative age of the particular cluster centroid.
  - 7. The method of claim 1 wherein assigning the document to a cluster based on the determined distances comprises:
    - if none of the determined distances between the feature vector of the document and the cluster centroids are below a predetermined threshold;
      
      assigning the document to a new cluster;
      
      designating the feature vector of the document as a cluster centroid for the new cluster; and
      
      assigning a relative age to the cluster centroid of the new cluster.
  - 8. The method of claim 1 wherein retrieving a set of cluster centroids based on a result of the applied locality sensitive hashing function of the feature vector comprises:
    - retrieving a set of cluster identifiers from a cluster table, the cluster identifiers each indicative of a respective cluster centroid; and
      
      retrieving the cluster centroids corresponding to the retrieved cluster identifiers from a memory.

9. An apparatus for clustering a plurality of documents from a data stream comprising:
- means for generating a feature vector for a document in the plurality of documents;
  
  means for applying a locality sensitive hashing function to the feature vector;
  
  means for retrieving a set of cluster centroids based on a result of the applied locality sensitive hashing function of the feature vector;
  
  means for determining a distance between the feature vector of the document and each of the cluster centroids; and
  
  means for assigning the document to a cluster based on the determined distances.
- View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
- - 10. The apparatus of claim 9 wherein the feature vector is {right arrow over (d)}=[d₁, . . . , d_T] and the means for applying a locality sensitive hashing function to the feature vector comprises:
    - means for calculating entries in a hash value
  - 11. The apparatus of claim 9 wherein the means for applying a locality sensitive hashing function to the feature vector comprises:
    - means for hashing the feature vector with a locality sensitive hashing function comprising a series of hashing functions;
      
      means for determining a set of values similar to a set of results of the hashing of the feature vector; and
      
      means for determining a set of cluster identifiers based on the determined set of values.
  - 12. The apparatus of claim 9 wherein the means for retrieving a set of cluster centroids based on a result of the applied locality sensitive hashing function of the feature vector further comprises:
    - means for selecting a group of cluster centroids based on a size of clusters associated with the cluster centroids.
  - 13. The apparatus of claim 9 wherein the means for assigning the document to a cluster based on the determined distances comprises:
    - means for assigning the document to clusters corresponding to that plurality of cluster centroids;
      
      means for updating the cluster centroids based on the feature vector of the document; and
      
      means for updating the relative ages of the cluster centroids if the determined distances between the feature vector of the document and a plurality of cluster centroids are below a predetermined threshold.
  - 14. The apparatus of claim 9 wherein the means for assigning the document to a cluster based on the determined distances comprises:
    - means for assigning the document to a cluster associated with the particular cluster centroid;
      
      means for updating the particular cluster centroid based on the feature vector of the document; and
      
      means for updating the relative age of the particular cluster centroid if a determined distance between the feature vector of the document and a particular cluster centroid is below a predetermined threshold and below the determined distances between the feature vector of the document and the other centroids.
  - 15. The apparatus of claim 9 wherein the means for assigning the document to a cluster based on the determined distances comprises:
    - means for assigning the document to a new cluster;
      
      means for designating the feature vector of the document as a cluster centroid for the new cluster; and
      
      means for assigning a relative age to the cluster centroid of the new cluster if none of the determined distances between the feature vector of the document and the cluster centroids are below a predetermined threshold.
  - 16. The apparatus of claim 9 wherein the means for retrieving a set of cluster centroids based on a result of the applied locality sensitive hashing function of the feature vector comprises:
    - means for retrieving a set of cluster identifiers from a cluster table, the cluster identifiers each indicative of a respective cluster centroid; and
      
      means for retrieving the cluster centroids corresponding to the retrieved cluster identifiers from a memory.

17. A machine readable medium having program instructions stored thereon, the instructions capable of execution by a processor and defining the steps of:
- generating a feature vector for a document in a plurality of documents;
  
  applying a locality sensitive hashing function to the feature vector;
  
  retrieving a set of cluster centroids based on a result of the applied locality sensitive hashing function of the feature vector;
  
  determining a distance between the feature vector of the document and each of the cluster centroids; and
  
  assigning the document to a cluster based on the determined distances.
- View Dependent Claims (18, 19, 20, 21, 22, 23, 24)
- - 18. The machine readable medium of claim 17, wherein the feature vector is {right arrow over (d)}=[d₁, . . . , d_T] and the instructions for applying a locality sensitive hashing function to the feature vector comprises wherein the instructions further define the step of:
    - calculating entries in a hash value
  - 19. The machine readable medium of claim 17, wherein the instructions for applying a locality sensitive hashing function to the feature vector further define the steps of:
    - hashing the feature vector with a locality sensitive hashing function comprising a series of hashing functions;
      
      determining a set of values similar to a set of results of the hashing of the feature vector; and
      
      determining a set of cluster identifiers based on the determined set of values.
  - 20. The machine readable medium of claim 17, wherein the instructions for retrieving a set of cluster centroids based on a result of the applied locality sensitive hashing function of the feature vector further define the step of:
    - selecting a group of cluster centroids based on a size of clusters associated with the cluster centroids.
  - 21. The machine readable medium of claim 17, wherein the instructions for assigning the document to a cluster based on the determined distances further define the steps of:
    - if the determined distances between the feature vector of the document and a plurality of cluster centroids are below a predetermined threshold;
      
      assigning the document to clusters corresponding to that plurality of cluster centroids;
      
      updating the cluster centroids based on the feature vector of the document; and
      
      updating the relative ages of the cluster centroids.
  - 22. The machine readable medium of claim 17, wherein the instructions for assigning the document to a cluster based on the determined distances further define the steps of:
    - if a determined distance between the feature vector of the document and a particular cluster centroid is below a predetermined threshold and below the determined distances between the feature vector of the document and the other centroids;
      
      assigning the document to a cluster associated with the particular cluster centroid;
      
      updating the particular cluster centroid based on the feature vector of the document; and
      
      updating the relative age of the particular cluster centroid.
  - 23. The machine readable medium of claim 17, wherein the instructions for assigning the document to a cluster based on the determined distances further define the steps of:
    - if none of the determined distances between the feature vector of the document and the cluster centroids are below a predetermined threshold;
      
      assigning the document to a new cluster;
      
      designating the feature vector of the document as a cluster centroid for the new cluster; and
      
      assigning a relative age to the cluster centroid of the new cluster.
  - 24. The machine readable medium of claim 17, wherein the instructions for retrieving a set of cluster centroids based on a result of the applied locality sensitive hashing function of the feature vector further define the steps of:
    - retrieving a set of cluster identifiers from a cluster table, the cluster identifiers each indicative of a respective cluster centroid; and
      
      retrieving the cluster centroids corresponding to the retrieved cluster identifiers from a memory.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Siemens Corp. (Siemens AG)
Original Assignee
Siemens Corp. (Siemens AG)
Inventors
Glomann, Bernhard, Neubauer, Claus, Moerchen, Fabian, Brinker, Klaus
Primary Examiner(s)
Vincent; David R

Application Number

US12/072,179
Publication Number

US 20080205774A1
Time in Patent Office

932 Days
Field of Search

706/20, 706/45, 382224-228, 702/81, 702/82, 702/183
US Class Current

706/45
CPC Class Codes

G06F 16/353 into predefined classes

Document clustering that applies a locality sensitive hashing function to a feature vector to obtain a limited set of candidate clusters

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

24 Claims

Specification

Solutions

Use Cases

Quick Links

Document clustering that applies a locality sensitive hashing function to a feature vector to obtain a limited set of candidate clusters

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

24 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links