×

Document classification using multiscale text fingerprints

  • US 8,935,783 B2
  • Filed: 03/08/2013
  • Issued: 01/13/2015
  • Est. Priority Date: 03/08/2013
  • Status: Active Grant
First Claim
Patent Images

1. A client computer system comprising at least one processor configured to determine a text fingerprint of a target electronic document so that a length of the text fingerprint is constrained between a lower bound and an upper bound, wherein the lower and upper bounds are predetermined, and wherein determining the text fingerprint comprises:

  • selecting a plurality of text tokens of the target electronic document, wherein selecting the plurality of text tokens comprises;

    selecting a preliminary plurality of text tokens of the target electronic document,determining a count of the preliminary plurality of text tokens, andin response, when the count of the preliminary plurality of text tokens exceeds a predetermined threshold, prune the preliminary plurality of text tokens to form the selected plurality of text tokens so that a count of the selected plurality of tokens does not exceed the predetermined threshold;

    in response to selecting the plurality of text tokens, determining a fingerprint fragment size according to the upper and lower bounds, and according to the count of the selected plurality of text tokens;

    determining a plurality of fingerprint fragments, each fingerprint fragment of the plurality of fingerprint fragments determined according to a hash of a distinct text token of the selected plurality of text tokens, each fingerprint fragment consisting of a sequence of characters, a length of the sequence chosen to equal the fingerprint fragment size; and

    concatenating the plurality of fingerprint fragments to form the text fingerprint.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×