METHOD AND APPARATUS FOR DOCUMENT CLUSTERING AND DOCUMENT SKETCHING

US 20070005589A1
Filed: 06/29/2006
Published: 01/04/2007
Est. Priority Date: 07/01/2005
Status: Active Grant

First Claim

Patent Images

1. An automatic document classification apparatus, comprising:

computer implemented means for generating document sketches;

computer implemented means for automatically classifying a collection of documents into clusters based on a distance metric, wherein distance between any two documents in a cluster is smaller than the distance between documents across clusters; and

computer implemented means for classifying a new document into an appropriate document cluster.

View all claims

7 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A first embodiment of the invention provides a system that automatically classifies documents in a collection into clusters based on the similarities between documents, that automatically classifies new documents into the right clusters, and that may change the number or parameters of clusters under various circumstances. A second embodiment of the invention provides a technique for comparing two documents, in which a fingerprint or sketch of each document is computed. In particular, this embodiment of the invention uses a specific algorithm to compute the document'"'"'s fingerprint, One embodiment uses a sentence in the document as a logical delimiter or window from which significant words are extracted and, thereafter, a hash is computed of all pair-wise permutations. Words are extracted based on their weight in the document, which can be computed using measures such as term frequency and the inverse document frequency.

Citations

22 Claims

1. An automatic document classification apparatus, comprising:
- computer implemented means for generating document sketches;
  
  computer implemented means for automatically classifying a collection of documents into clusters based on a distance metric, wherein distance between any two documents in a cluster is smaller than the distance between documents across clusters; and
  
  computer implemented means for classifying a new document into an appropriate document cluster.
- View Dependent Claims (2, 3, 4)
- - 2. The apparatus of claim 1, further comprising:
    - computer implemented means for organizing said clusters hierarchically by approximating said distance metric by a tree metric.
  - 3. The apparatus of claim 2, further comprising:
    - computer implemented means for generating a taxonomy by using a parameter that sets a threshold on cohesiveness of a cluster, wherein cohesiveness of a cluster comprises a largest distance between any two documents in said cluster.
  - 4. The apparatus of claim 3 wherein, based on said cohesiveness, nodes in a tree can be merged to form bigger clusters having larger diameters, as long as said cohesiveness threshold is not violated.

5. A method for computing hierarchical clustering of documents, comprising the steps of:
- computing a sketch for every document in a collection;
  
  using said sketch to compute similarity between all document pairs in said collection;
  
  storing a result of said computation in a distance matrix;
  
  generating a metric based on nearest neighbors of each entry in said matrix;
  
  computing similarity as a function of a symmetric difference between sets of neighbors listed for any two documents in said collection;
  
  approximating a metric by a tree metric
- View Dependent Claims (6, 7, 8, 9, 22)
- - 6. The method of claim 5, wherein said distance matrix comprises a sparse matrix.
  - 7. The method of claim 5, wherein the number of neighbors comprises a parameter that can be user adjusted.
  - 8. The method of claim 5, wherein said tree metric is determined by using Bartal'"'"'s approximation algorithm.
  - 9. The method of claim 5, wherein the size of each cluster and depth/width of hierarchical clusters is controlled by the number of nearest neighbors included in the metric computation.
  - 22. The method of claim 5, wherein the hierarchy implicit in the tree metric is used to provide an efficient navigation and traversal of documents, given a query document or document cluster, at different levels of granularity;
    - and wherein the traversal itself is based on an order that is generated by the nearest neighbors of a particular node, comprising any of a document or a document cluster, in the hierarchy.

10. A method for computing the sketch for a document, comprising the steps of:
- using a sentence in a document as a logical delimiter or window from which significant words are extracted;
  
  extracting said words based on their weight in the document;
  
  lexicographically sorting words in a phrase before computing a sketch;
  
  computing a hash of all pair-wise permutations for said significant words;
  
  sorting said computed hashes; and
  
  choosing the top-m hashes to represent the document.
- View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21)
- - 11. The method of claim 10, wherein typical values of m are 256 to 512 for large documents (>
    - 1M).
  - 12. The method of claim 10, wherein weight is computed using measures comprising any of term frequency and inverse document frequency.
  - 13. The method of claim 10, further comprising the step of:
    - transporting sketches using Bloom filters.
  - 14. The method of claim 10, further comprising the step of:
    - computing a sketch of a hierarchy or a taxonomy given the sketches of the documents in the taxonomy.
  - 15. The method of claim 10, further comprising the step of:
    - performing a selection based associative search of documents by allowing an end user to select a portion of a document and then identifying documents containing similar information.
  - 16. The method of claim 10, further comprising the step of:
    - performing automatic taxonomy generation and clustering of documents using a tree metric approach to maintain original distances between documents while at the same time organizing said documents in a hierarchy.
  - 17. The method of claim 16, further comprising the step of:
    - performing efficient extraction of taxonomies from said tree metric.
  - 18. The method of claim 10, further comprising the step of:
    - performing on-line classification, wherein documents arrive at different times and are indexed in an existing taxonomy.
  - 19. The method of claim 10, further comprising the steps of:
    - providing a compact representation of a sketch; and
      
      computing similarities for associative search.
  - 20. The method of claim 10, further comprising the step of:
    - in a distributed environment for collaboratively shared documents, using a sketch for efficient inter-repository distribution, communication, and retrieval of information across networks, wherein a whole document or a collection need not be transported or queried against, wherein said sketch substitutes for a document in all supported computations.
  - 21. The method of claim 20, wherein an efficient associative search offers similar documents when a requested document is not available.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Ebrary
Original Assignee
Ebrary
Inventors
GOLLAPUDI, SREENIVAS

Granted Patent

US 7,433,869 B2
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 16/313   Selection or weighting of t...

G06F 16/355   Class or cluster creation o...

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99935   Query augmenting and refini...

Y10S 707/99942   Manipulating data structure...

Y10S 707/99943   Generating database or data...

METHOD AND APPARATUS FOR DOCUMENT CLUSTERING AND DOCUMENT SKETCHING

First Claim

7 Assignments

0 Petitions

Accused Products

Abstract

Citations

22 Claims

Specification

Solutions

Use Cases

Quick Links

METHOD AND APPARATUS FOR DOCUMENT CLUSTERING AND DOCUMENT SKETCHING

First Claim

7 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

22 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links