×

Document Classification Using Multiscale Text Fingerprints

  • US 20140259157A1
  • Filed: 03/08/2013
  • Published: 09/11/2014
  • Est. Priority Date: 03/08/2013
  • Status: Active Grant
First Claim
Patent Images

1. A client computer system comprising at least one processor configured to determine a text fingerprint of a target electronic document so that a length of the text fingerprint is constrained between a lower bound and an upper bound, wherein the lower and upper bounds are predetermined, and wherein determining the text fingerprint comprises:

  • selecting a plurality of text tokens of the target electronic document;

    in response to selecting the plurality of text tokens, determining a fingerprint fragment size according to the upper and lower bounds, and according to a count of the selected plurality of text tokens;

    determining a plurality of fingerprint fragments, each fingerprint fragment of the plurality of fingerprint fragments determined according to a hash of a distinct text token of the selected plurality of text tokens, each fingerprint fragment consisting of a sequence of characters, a length of the sequence chosen to equal the fingerprint fragment size; and

    concatenating the plurality of fingerprint fragments to form the text fingerprint.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×