×

DOCUMENT SIMILARITY DETECTION AND CLASSIFICATION SYSTEM

  • US 20050060643A1
  • Filed: 08/12/2004
  • Published: 03/17/2005
  • Est. Priority Date: 08/25/2003
  • Status: Abandoned Application
First Claim
Patent Images

1. A method for automatically classifying unclassified documents, comprising the steps of:

  • a. processing, on a first processing system, a plurality of sample documents to identify a plurality of sample document feature sets of potentially duplicated and significant sample document features, whereby each sample feature set is associated with one of said plurality of sample documents;

    b. electronically associating with each of said plurality of sample document feature sets a set of at least one manually selected document annotation values, whereby said document annotation values each represent a subjective classification of one of said plurality of sample documents with which said document annotation values are individually associated;

    c. electronically associating with each of said plurality of sample document feature sets a set of at least one manually selected document feature annotation values, whereby said document feature annotation values each represent a subjective classification of one of a plurality of sample document features with which said document feature annotation values are individually associated;

    d. processing, on a second processing system, an unclassified document to identify a set of potentially duplicated and significant unclassified document features;

    e. comparing, on said second processing system, said set of potentially duplicated and significant unclassified document features to each of said sample document feature sets, inclusive of said document annotation values and said document feature annotation values associated with each of said sample document feature sets;

    f. determining which of said plurality of sample document feature sets shares in common with any of the features comprising an unclassified document feature set a largest weighted quantity of features subjectively classified and annotated as significant, whereby a most significantly resembling sample document may be determined; and

    g. outputting a significant similarity measurement value and a classification value for said unclassified document according to a weighted ratio of matching significant features of said most significantly resembling sample document as compared to all of said significant features of said most significantly resembling sample document.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×