×

METHODS AND SYSTEMS TO EFFICIENTLY FIND SIMILAR AND NEAR-DUPLICATE EMAILS AND FILES

  • US 20120209853A1
  • Filed: 02/16/2011
  • Published: 08/16/2012
  • Est. Priority Date: 01/23/2006
  • Status: Active Grant
First Claim
Patent Images

1. A method for generating document signatures, the method comprising:

  • receiving, at one or more computer systems, a plurality of documents, each document in the plurality of documents having a plurality of terms;

    generating, with one or more processors associated with the one or more computer systems, a first set of trigrams for each document in the plurality of documents, each trigram in the first set of trigrams for a given document in the plurality of documents being a sequence in the given document of three terms in the plurality of terms of the given document;

    determining, with one or more processors associated with the one or more computer systems, a second set of trigrams for each document in the plurality of documents based on the first set of trigrams for the document and first filter criteria, the second set of trigrams for a given document in the plurality of documents being a subset of the first set of trigrams for the given document and having one or more trigrams that satisfy the first filter criteria; and

    storing, in a storage device associated with the one or more computer systems, the second set of trigrams for each document in the plurality of documents as offsets into the document for each term of each trigram in the second set of trigrams.

View all claims
  • 8 Assignments
Timeline View
Assignment View
    ×
    ×