×

Comparing similarity between documents for filtering unwanted documents

  • US 8,874,663 B2
  • Filed: 08/28/2009
  • Issued: 10/28/2014
  • Est. Priority Date: 08/28/2009
  • Status: Active Grant
First Claim
Patent Images

1. A computer-implemented method comprising:

  • segmenting at least a portion of a reference document into a plurality of reference shingles, each reference shingle comprising a contiguous portion of the reference document that is of a predetermined length and shorter than the reference document, the plurality of reference shingles comprising a first series of reference shingles and a second series of reference shingles, the first series of reference shingles overlapping and having shifted starting locations relative to the second series of reference shingles;

    storing the plurality of reference shingles in a trie structure and counts indicating a number of times each of the plurality of reference shingles comprising the first series of reference shingles and the second series of reference shingles appear in the plurality of reference shingles, a number of levels in the trie structure corresponding to a number of characters in each reference shingle;

    segmenting at least a portion of a candidate document into a plurality of candidate shingles comprising a contiguous portion of the candidate document that is of the predetermined length and shorter than the candidate document;

    determining a degree of matching between the stored reference shingles and the candidate shingles; and

    computing a similarity index representing similarity between the reference document and the candidate document by an equation

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×