Document classification using multiscale text fingerprints
First Claim
1. A client computer system comprising at least one processor configured to determine a text fingerprint of a target electronic document so that a length of the text fingerprint is constrained between a lower bound and an upper bound, wherein the lower and upper bounds are predetermined, and wherein determining the text fingerprint comprises:
- selecting a plurality of text tokens of the target electronic document, wherein selecting the plurality of text tokens comprises;
selecting a preliminary plurality of text tokens of the target electronic document,determining a count of the preliminary plurality of text tokens, andin response, when the count of the preliminary plurality of text tokens exceeds a predetermined threshold, prune the preliminary plurality of text tokens to form the selected plurality of text tokens so that a count of the selected plurality of tokens does not exceed the predetermined threshold;
in response to selecting the plurality of text tokens, determining a fingerprint fragment size according to the upper and lower bounds, and according to the count of the selected plurality of text tokens;
determining a plurality of fingerprint fragments, each fingerprint fragment of the plurality of fingerprint fragments determined according to a hash of a distinct text token of the selected plurality of text tokens, each fingerprint fragment consisting of a sequence of characters, a length of the sequence chosen to equal the fingerprint fragment size; and
concatenating the plurality of fingerprint fragments to form the text fingerprint.
1 Assignment
0 Petitions
Accused Products
Abstract
Described systems and methods allow a classification of electronic documents such as email messages and HTML documents, according to a document-specific text fingerprint. The text fingerprint is calculated for a text block of each target document, and comprises a sequence of characters determined according to a plurality of text tokens of the respective text block. In some embodiments, the length of the text fingerprint is forced within a pre-determined range of lengths (e.g. between 129 and 256 characters) irrespective of the length of the text block, by zooming in for short text blocks, and zooming out for long ones. Classification may include, for instance, determining whether an electronic document represents unsolicited communication (spam) or online fraud such as phishing.
12 Citations
22 Claims
-
1. A client computer system comprising at least one processor configured to determine a text fingerprint of a target electronic document so that a length of the text fingerprint is constrained between a lower bound and an upper bound, wherein the lower and upper bounds are predetermined, and wherein determining the text fingerprint comprises:
-
selecting a plurality of text tokens of the target electronic document, wherein selecting the plurality of text tokens comprises; selecting a preliminary plurality of text tokens of the target electronic document, determining a count of the preliminary plurality of text tokens, and in response, when the count of the preliminary plurality of text tokens exceeds a predetermined threshold, prune the preliminary plurality of text tokens to form the selected plurality of text tokens so that a count of the selected plurality of tokens does not exceed the predetermined threshold; in response to selecting the plurality of text tokens, determining a fingerprint fragment size according to the upper and lower bounds, and according to the count of the selected plurality of text tokens; determining a plurality of fingerprint fragments, each fingerprint fragment of the plurality of fingerprint fragments determined according to a hash of a distinct text token of the selected plurality of text tokens, each fingerprint fragment consisting of a sequence of characters, a length of the sequence chosen to equal the fingerprint fragment size; and concatenating the plurality of fingerprint fragments to form the text fingerprint. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A server computer system comprising at least one processor configured to perform transactions with a plurality of client systems, wherein a transaction comprises:
-
receiving a text fingerprint from a client system of the plurality of client systems, the text fingerprint determined for a target electronic document so that a length of the text fingerprint is constrained between a lower bound and an upper bound, wherein the lower and upper bounds are predetermined; and sending to the client system a target label indicative of a category of documents that the target electronic document belongs to, wherein determining the text fingerprint comprises; selecting a plurality of text tokens of the target electronic document, wherein selecting the plurality of text tokens comprises; selecting a preliminary plurality of text tokens of the target electronic document, determining a count of the preliminary plurality of text tokens, and in response, when the count of the preliminary plurality of text tokens exceeds a predetermined threshold, prune the preliminary plurality of text tokens to form the selected plurality of text tokens so that a count of the selected plurality of tokens does not exceed the predetermined threshold; in response to selecting the plurality of text tokens, determining a fingerprint fragment size according to the upper and lower bounds, and according to the count of the selected plurality of text tokens; determining a plurality of fingerprint fragments, each fingerprint fragment of the plurality of fingerprint fragments determined according to a hash of a distinct text token of the selected plurality of text tokens, each fingerprint fragment consisting of a sequence of characters, a length of the sequence chosen to equal the fingerprint fragment size; and concatenating the plurality of fingerprint fragments to form the text fingerprint, and wherein determining the target label comprises; retrieving a reference fingerprint from a database of reference fingerprints, the reference fingerprint determined for a reference electronic document belonging to the category, the reference fingerprint selected according to a length of the reference fingerprint so that the length of the reference fingerprint is between the upper and lower bounds; and determining whether the target electronic document belongs to the category according to a result of comparing the text fingerprint to the reference fingerprint. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19)
-
-
20. A method comprising employing at least one processor of a client computer system to determine a text fingerprint of a target electronic document so that a length of the text fingerprint is constrained between a lower bound and an upper bound, wherein the lower and upper bounds are predetermined, and wherein determining the text fingerprint comprises:
-
selecting a plurality of text tokens of the target electronic document, wherein selecting the plurality of text tokens comprises; selecting a preliminary plurality of text tokens of the target electronic document, determining a count of the preliminary plurality of text tokens, and in response, when the count of the preliminary plurality of text tokens exceeds a predetermined threshold, prune the preliminary plurality of text tokens to form the selected plurality of text tokens so that a count of the selected plurality of tokens does not exceed the predetermined threshold; in response to selecting the plurality of text tokens, determining a fingerprint fragment size according to the upper and lower bounds, and according to the count of the selected plurality of text tokens; determining a plurality of fingerprint fragments, each fingerprint fragment of the plurality of fingerprint fragments determined according to a hash of a distinct text token of the selected plurality of text tokens, each fingerprint fragment consisting of a sequence of characters, a length of the sequence chosen to equal the fingerprint fragment size; and concatenating the plurality of fingerprint fragments to form the text fingerprint. - View Dependent Claims (21)
-
-
22. A method comprising employing at least one processor of a server computer system configured to perform transactions with a plurality of client systems, to:
-
receive a text fingerprint from a client system of the plurality of client systems, the text fingerprint determined for a target electronic document so that a length of the text fingerprint is constrained between a lower bound and an upper bound, wherein the lower and upper bounds are predetermined; and send to the client system a target label determined for the target electronic document, the target label indicating a category of documents that the target electronic document belongs to, wherein determining the text fingerprint comprises; selecting a plurality of text tokens of the target electronic document, wherein selecting the plurality of text tokens comprises; selecting a preliminary plurality of text tokens of the target electronic document, determining a count of the preliminary plurality of text tokens, and in response, when the count of the preliminary plurality of text tokens exceeds a predetermined threshold, prune the preliminary plurality of text tokens to form the selected plurality of text tokens so that a count of the selected plurality of tokens does not exceed the predetermined threshold; in response to selecting the plurality of text tokens, determining a fingerprint fragment size according to the upper and lower bounds, and according to the count of the selected plurality of text tokens; determining a plurality of fingerprint fragments, each fingerprint fragment of the plurality of fingerprint fragments determined according to a hash of a distinct text token of the selected plurality of text tokens, each fingerprint fragment consisting of a sequence of characters, a length of the sequence chosen to equal the fingerprint fragment size; and concatenating the plurality of fingerprint fragments to form the text fingerprint, and wherein determining the target label comprises; retrieving a reference fingerprint from a database of reference fingerprints, the reference fingerprint determined for a reference electronic document belonging to the category, the reference fingerprint selected according to a length of the reference fingerprint so that the length of the reference fingerprint is between the upper and lower bounds; and determining whether the target electronic document belongs to the category according to a result of comparing the text fingerprint to the reference fingerprint.
-
Specification