×

Method and apparatus for document clustering and document sketching

  • US 8,255,397 B2
  • Filed: 08/26/2008
  • Issued: 08/28/2012
  • Est. Priority Date: 07/01/2005
  • Status: Active Grant
First Claim
Patent Images

1. An automatic document classification apparatus, comprising:

  • a system comprising a processor configured for automatically generating document sketches by using a sentence in a document as a logical delimiter from which significant words are extracted and, thereafter, for each sentence computing a hash of all pair-wise permutations of said significant words which are extracted from said sentence, wherein said significant words are extracted based on a word'"'"'s weight in the document, which is computed using measures comprising either of term frequency and inverse document frequency;

    said processor configured for using said document sketches for automatically classifying a collection of documents into clusters based on a distance metric, wherein distance between any two documents in a cluster is smaller than the distance between documents across clusters; and

    said processor configured for automatically classifying a new document into an appropriate document cluster based upon said distance metric;

    said processor applying said automatic classification of said new document into an appropriate document cluster based upon said distance metric to effect at least one of;

    user selection of a section of a document to identify documents containing similar information;

    automatic taxonomy generation and clustering of documents;

    inter-repository distribution, communication, and retrieval of information across networks, wherein said document sketch is substituted for a document; and

    constructing a traversal order of a document set from nearest neighbors of a query document in a metric space.

View all claims
  • 6 Assignments
Timeline View
Assignment View
    ×
    ×