Method and apparatus for document clustering and document sketching

US 8,255,397 B2
Filed: 08/26/2008
Issued: 08/28/2012
Est. Priority Date: 07/01/2005
Status: Active Grant

First Claim

Patent Images

1. An automatic document classification apparatus, comprising:

a system comprising a processor configured for automatically generating document sketches by using a sentence in a document as a logical delimiter from which significant words are extracted and, thereafter, for each sentence computing a hash of all pair-wise permutations of said significant words which are extracted from said sentence, wherein said significant words are extracted based on a word'"'"'s weight in the document, which is computed using measures comprising either of term frequency and inverse document frequency;

said processor configured for using said document sketches for automatically classifying a collection of documents into clusters based on a distance metric, wherein distance between any two documents in a cluster is smaller than the distance between documents across clusters; and

said processor configured for automatically classifying a new document into an appropriate document cluster based upon said distance metric;

said processor applying said automatic classification of said new document into an appropriate document cluster based upon said distance metric to effect at least one of;

user selection of a section of a document to identify documents containing similar information;

automatic taxonomy generation and clustering of documents;

inter-repository distribution, communication, and retrieval of information across networks, wherein said document sketch is substituted for a document; and

constructing a traversal order of a document set from nearest neighbors of a query document in a metric space.

View all claims

6 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A first embodiment of the invention provides a system that automatically classifies documents in a collection into clusters based on the similarities between documents, that automatically classifies new documents into the right clusters, and that may change the number or parameters of clusters under various circumstances. A second embodiment of the invention provides a technique for comparing two documents, in which a fingerprint or sketch of each document is computed. In particular, this embodiment of the invention uses a specific algorithm to compute the document'"'"'s fingerprint. One embodiment uses a sentence in the document as a logical delimiter or window from which significant words are extracted and, thereafter, a hash is computed of all pair-wise permutations. Words are extracted based on their weight in the document, which can be computed using measures such as term frequency and the inverse document frequency.

Citations

4 Claims

1. An automatic document classification apparatus, comprising:
- a system comprising a processor configured for automatically generating document sketches by using a sentence in a document as a logical delimiter from which significant words are extracted and, thereafter, for each sentence computing a hash of all pair-wise permutations of said significant words which are extracted from said sentence, wherein said significant words are extracted based on a word'"'"'s weight in the document, which is computed using measures comprising either of term frequency and inverse document frequency;
  
  said processor configured for using said document sketches for automatically classifying a collection of documents into clusters based on a distance metric, wherein distance between any two documents in a cluster is smaller than the distance between documents across clusters; and
  
  said processor configured for automatically classifying a new document into an appropriate document cluster based upon said distance metric;
  
  said processor applying said automatic classification of said new document into an appropriate document cluster based upon said distance metric to effect at least one of;
  
  user selection of a section of a document to identify documents containing similar information;
  
  automatic taxonomy generation and clustering of documents;
  
  inter-repository distribution, communication, and retrieval of information across networks, wherein said document sketch is substituted for a document; and
  
  constructing a traversal order of a document set from nearest neighbors of a query document in a metric space.
- View Dependent Claims (2, 3, 4)
- - 2. The apparatus of claim 1, further comprising:
    - said processor configured for organizing said clusters hierarchically by approximating said distance metric by a tree metric.
  - 3. The apparatus of claim 2, further comprising:
    - said processor configured for generating a taxonomy by using a parameter that sets a threshold on cohesiveness of a cluster, wherein cohesiveness of a cluster comprises a largest distance between any two documents in said cluster.
  - 4. The apparatus of claim 3 wherein, based on said cohesiveness, nodes in the taxonomy comprise clusters that can be merged to form bigger clusters having larger diameters, as long as said cohesiveness threshold is not violated.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Ebrary
Original Assignee
Ebrary
Inventors
Gollapudi, Sreenivas
Primary Examiner(s)
Pulliam, Christyann
Assistant Examiner(s)
Khoshnoodi, Fariborz

Application Number

US12/198,841
Publication Number

US 20080319941A1
Time in Patent Office

1,463 Days
Field of Search

707/737, 707/711, 707/723, 707/727, 707/728, 707/729, 707/730, 707/750, 707/754, 707/812, 707/736
US Class Current

707/736
CPC Class Codes

G06F 16/313   Selection or weighting of t...

G06F 16/355   Class or cluster creation o...

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99935   Query augmenting and refini...

Y10S 707/99942   Manipulating data structure...

Y10S 707/99943   Generating database or data...

Method and apparatus for document clustering and document sketching

First Claim

6 Assignments

0 Petitions

Accused Products

Abstract

Citations

4 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for document clustering and document sketching

First Claim

6 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

4 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links