DOCUMENT SIMILARITY DETECTION AND CLASSIFICATION SYSTEM
First Claim
1. A method for automatically classifying unclassified documents, comprising the steps of:
- a. processing, on a first processing system, a plurality of sample documents to identify a plurality of sample document feature sets of potentially duplicated and significant sample document features, whereby each sample feature set is associated with one of said plurality of sample documents;
b. electronically associating with each of said plurality of sample document feature sets a set of at least one manually selected document annotation values, whereby said document annotation values each represent a subjective classification of one of said plurality of sample documents with which said document annotation values are individually associated;
c. electronically associating with each of said plurality of sample document feature sets a set of at least one manually selected document feature annotation values, whereby said document feature annotation values each represent a subjective classification of one of a plurality of sample document features with which said document feature annotation values are individually associated;
d. processing, on a second processing system, an unclassified document to identify a set of potentially duplicated and significant unclassified document features;
e. comparing, on said second processing system, said set of potentially duplicated and significant unclassified document features to each of said sample document feature sets, inclusive of said document annotation values and said document feature annotation values associated with each of said sample document feature sets;
f. determining which of said plurality of sample document feature sets shares in common with any of the features comprising an unclassified document feature set a largest weighted quantity of features subjectively classified and annotated as significant, whereby a most significantly resembling sample document may be determined; and
g. outputting a significant similarity measurement value and a classification value for said unclassified document according to a weighted ratio of matching significant features of said most significantly resembling sample document as compared to all of said significant features of said most significantly resembling sample document.
2 Assignments
0 Petitions
Accused Products
Abstract
A document similarity detection and classification system is presented. The system employs a case-based method of classifying electronically distributed documents in which content chunks of an unclassified document are compared to the sets of content chunks comprising each of a set of previously classified sample documents in order to determine a highest level of resemblance between an unclassified document and any of a set of previously classified documents. The sample documents have been manually reviewed and annotated to distinguish document classifications and to distinguish significant content chunks from insignificant content chunks. These annotations are used in the similarity comparison process. If a significant resemblance level exceeding a predetermined threshold is detected, the classification of the most significantly resembling sample document is assigned to the unclassified document. Sample documents may be acquired to build and maintain a repository of sample documents by detecting unclassified documents that are similar to other unclassified documents and subjecting at least some similar documents to a manual review and classification process. In a preferred embodiment the invention may be used to classify email messages in support of a message filtering or classification objective.
-
Citations
42 Claims
-
1. A method for automatically classifying unclassified documents, comprising the steps of:
-
a. processing, on a first processing system, a plurality of sample documents to identify a plurality of sample document feature sets of potentially duplicated and significant sample document features, whereby each sample feature set is associated with one of said plurality of sample documents;
b. electronically associating with each of said plurality of sample document feature sets a set of at least one manually selected document annotation values, whereby said document annotation values each represent a subjective classification of one of said plurality of sample documents with which said document annotation values are individually associated;
c. electronically associating with each of said plurality of sample document feature sets a set of at least one manually selected document feature annotation values, whereby said document feature annotation values each represent a subjective classification of one of a plurality of sample document features with which said document feature annotation values are individually associated;
d. processing, on a second processing system, an unclassified document to identify a set of potentially duplicated and significant unclassified document features;
e. comparing, on said second processing system, said set of potentially duplicated and significant unclassified document features to each of said sample document feature sets, inclusive of said document annotation values and said document feature annotation values associated with each of said sample document feature sets;
f. determining which of said plurality of sample document feature sets shares in common with any of the features comprising an unclassified document feature set a largest weighted quantity of features subjectively classified and annotated as significant, whereby a most significantly resembling sample document may be determined; and
g. outputting a significant similarity measurement value and a classification value for said unclassified document according to a weighted ratio of matching significant features of said most significantly resembling sample document as compared to all of said significant features of said most significantly resembling sample document. - View Dependent Claims (2, 3, 4, 5, 39, 40, 41, 42)
-
-
6. A method for automatically classifying unclassified documents, comprising the steps of:
-
a. registering, on a first processing system, each of said plurality of sample documents representative of at least one of a plurality of document classifications;
b. parsing each of said plurality of sample documents into at least one of a plurality of partial document content features according to a set of document parsing rules;
c. selectively decoding, removing and discarding from each of said sample documents, according to a set of document content decoding and removal rules, at least one of a plurality of said partial document content features, or portions of partial document content features, whereby any of said partial document content features that are considered insignificant for document classification purposes or are considered to be obfuscating content that exists to subvert said document classification process may be removed;
d. determining and recording, by a manual document review and electronic annotation process, at least one of a plurality of subjective classifications of each of said plurality of sample documents, whereby at least one of a plurality of subjective classification labels are associated with each of said sample documents;
e. determining and recording, by a manual document review and electronic annotation process, at least one of a plurality of subjective classifications of each of said plurality of partial document content features of each of said sample documents, whereby at least one of a plurality of subjective classification labels are associated with each of said sample document'"'"'s partial content features;
f. storing for each annotated sample document, on said first processing system, an annotated sample document record, inclusive of said sample document'"'"'s content, said set of partial document content features, a set of unique digests of each partial content feature, at least one of said document annotation values, at least one of said plurality of said document feature annotation values, and other document attribute data;
g. storing, on said second processing system, a copy of each of said annotated sample document records;
h. parsing, on a second processing system, an unclassified document into at least one of said plurality of partial document content features and selectively removing and discarding portions of said unclassified document'"'"'s content in a manner consistent with steps 6b and 6c above;
i. querying said second processing system using said unclassified document'"'"'s residual partial document content features or unique digests thereof and returning a list of all partially resembling sample documents which share in common at least one of a plurality of matching partial document content features with said unclassified document, subject to a requirement that any of said partial document content features that match are also subjectively classified and annotated as significant in any of said sample documents. j. calculating a set of ratios of characters comprising said unclassified document'"'"'s partial document content features that match said significant partial document content features contained in each of said partially resembling sample documents in said set of partially matching sample documents, as compared to a count of total characters comprising said significant partial document content features found in said partially resembling sample documents, resulting in a set of significant partial document content feature similarity scores;
k. comparing the highest of said scores to a predetermined document similarity threshold value; and
l. assigning said unclassified document said document similarity score and a classification value matching said subjective classification of said most closely resembling sample document if said document similarity score exceeds said predetermined threshold value, otherwise assigning said unclassified document a null or non-matching classification. - View Dependent Claims (7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27)
-
-
28. A method for automatically identifying in a document a set of potentially duplicated and significant document features, comprising the steps of:
-
a. parsing said document into at least one of a plurality of said partial document content features according to a set of document parsing rules;
b. selectively removing and discarding from said sample document, according to a set of document content removal rules, at least one of a plurality of said partial document content features, or portions of said partial document content features, that are considered insignificant for document classification purposes or are considered to be obfuscating content that exists to subvert said document classification process, whereby any remaining content may be considered potentially duplicated and significant. - View Dependent Claims (29)
-
-
30. A method of excluding from consideration in a document similarity measurement process semantically insignificant or obfuscating partial document content features contained within sample documents, comprising the steps of:
-
a. selecting and recording, by a manual document review and electronic annotation process, at least one of a plurality of subjective classification values of each of said plurality of partial document content features of said sample documents, wherein at least one of said plurality of subjective classification values are bound to a record of each of said sample documents'"'"' partial content features;
b. assigning a numerical weight of zero to any of said partial document content features which are labeled with a classification value indicating that said partial document content features are of a semantically insignificant or obfuscating content classification; and
c. including said zero-weighted classification values in said similarity measurement process steps that apply said weights to be assigned to each of said partial document content features comprising said sample documents.
-
-
31. A method of preventing the submission of a new sample document to a manual document review and annotation processing system when said new sample document is an exact or significantly partial duplicate of a previously submitted, reviewed and retained sample document, comprising the steps of:
-
a. parsing said new sample document into at least one of said plurality of partial document content features according to said set of document parsing rules;
b. selectively removing and discarding from said new sample document, according to said set of document content removal rules, at least one of a plurality of said partial document content features, or portions of said partial document content features, that are considered insignificant for document classification purposes or are considered to be obfuscating content that exists to subvert said document classification process;
c. querying said first processing system using said new sample document'"'"'s residual partial document content features or unique digests thereof and returning a list of all partially resembling existing sample documents which share in common at least one of a plurality of matching partial document content features with said new sample document, subject to said requirement that any of said partial document content features that match are also subjectively classified and annotated as significant in any said existing sample documents;
d. calculating a set of said ratios of characters comprising said new sample document'"'"'s partial document content features that match said significant partial document content features contained in each of said partially resembling existing sample documents, as compared to said total characters comprising said significant partial document content features found in said partially resembling sample documents, resulting in a set of significant partial document content feature similarity scores;
e. comparing the highest of said scores to said predetermined document similarity threshold value; and
f. accepting submission of said new sample document if said similarity score falls below a predetermined similarity score threshold value; and
g. discarding said new sample document if said similarity score equals or exceeds said predetermined similarity score threshold value, whereby said new sample document is excluded from said manual document review process due to its significant measured similarity to one of said plurality of existing sample documents.
-
-
32. A method of calculating a measure of similarity between two sets of partial document content features that adjusts for differences in relative length of partial document content features, comprising the steps of:
-
a. determining which of said set of partial document content features of a first document match any of said set of partial document content features of a second document, wherein said partial document content features are extracted from each of said documents according to the same method;
b. calculating a similarity score, wherein a similarity score is a ratio of said number of characters contained in matching partial document content features divided by said total number of characters in all of said partial document content features comprising said first document. - View Dependent Claims (33, 34)
-
-
35. A method of automatically determining the topical classification of a document, comprising the steps of:
-
a. determining that at least a minimum quantity of partial document content features of an unclassified document match any of a set of said partial document content features of a previously classified document;
b. determining that at least a minimum weighted relative quantity of said matching partial document content features of said previously classified document are individually classified as being indicative of said previously classified document'"'"'s topical classification;
c. assigning a topical classification of said previously classified document to said unclassified document. - View Dependent Claims (36, 37)
-
-
38. A method of selecting and collecting unclassified documents distributed in a network that may serve as samples of similar documents to be classified, comprising the steps of:
-
a. storing, for each unclassified or non-specifically classified document distributed in a network, profiles comprised of each document'"'"'s partial document content features;
b. deriving, for a first new document distributed within a network, a profile comprised of said first new document'"'"'s partial document content features;
c. calculating a measure of similarity of said first new document'"'"'s profile relative to each of said existing unclassified or non-specifically classified document profiles;
d. classifying as partially duplicated said first new document for which at least a predetermined minimum measure of similarity is calculated with respect to its profile as compared to any of said existing unclassified or non-specifically classified document profiles;
e. retaining as a candidate new sample document said first partially duplicated document copy and its profile.
-
Specification