METHOD AND APPARATUS FOR DOCUMENT CLUSTERING AND DOCUMENT SKETCHING
First Claim
1. An automatic document classification apparatus, comprising:
- computer implemented means for generating document sketches;
computer implemented means for automatically classifying a collection of documents into clusters based on a distance metric, wherein distance between any two documents in a cluster is smaller than the distance between documents across clusters; and
computer implemented means for classifying a new document into an appropriate document cluster.
7 Assignments
0 Petitions
Accused Products
Abstract
A first embodiment of the invention provides a system that automatically classifies documents in a collection into clusters based on the similarities between documents, that automatically classifies new documents into the right clusters, and that may change the number or parameters of clusters under various circumstances. A second embodiment of the invention provides a technique for comparing two documents, in which a fingerprint or sketch of each document is computed. In particular, this embodiment of the invention uses a specific algorithm to compute the document'"'"'s fingerprint, One embodiment uses a sentence in the document as a logical delimiter or window from which significant words are extracted and, thereafter, a hash is computed of all pair-wise permutations. Words are extracted based on their weight in the document, which can be computed using measures such as term frequency and the inverse document frequency.
-
Citations
22 Claims
-
1. An automatic document classification apparatus, comprising:
-
computer implemented means for generating document sketches;
computer implemented means for automatically classifying a collection of documents into clusters based on a distance metric, wherein distance between any two documents in a cluster is smaller than the distance between documents across clusters; and
computer implemented means for classifying a new document into an appropriate document cluster. - View Dependent Claims (2, 3, 4)
-
-
5. A method for computing hierarchical clustering of documents, comprising the steps of:
-
computing a sketch for every document in a collection;
using said sketch to compute similarity between all document pairs in said collection;
storing a result of said computation in a distance matrix;
generating a metric based on nearest neighbors of each entry in said matrix;
computing similarity as a function of a symmetric difference between sets of neighbors listed for any two documents in said collection;
approximating a metric by a tree metric - View Dependent Claims (6, 7, 8, 9, 22)
-
-
10. A method for computing the sketch for a document, comprising the steps of:
-
using a sentence in a document as a logical delimiter or window from which significant words are extracted;
extracting said words based on their weight in the document;
lexicographically sorting words in a phrase before computing a sketch;
computing a hash of all pair-wise permutations for said significant words;
sorting said computed hashes; and
choosing the top-m hashes to represent the document. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21)
-
Specification