Comparing similarity between documents for filtering unwanted documents
First Claim
Patent Images
1. A computer-implemented method comprising:
- segmenting at least a portion of a reference document into a plurality of reference shingles, each reference shingle comprising a contiguous portion of the reference document that is of a predetermined length and shorter than the reference document, the plurality of reference shingles comprising a first series of reference shingles and a second series of reference shingles, the first series of reference shingles overlapping and having shifted starting locations relative to the second series of reference shingles;
storing the plurality of reference shingles in a trie structure and counts indicating a number of times each of the plurality of reference shingles comprising the first series of reference shingles and the second series of reference shingles appear in the plurality of reference shingles, a number of levels in the trie structure corresponding to a number of characters in each reference shingle;
segmenting at least a portion of a candidate document into a plurality of candidate shingles comprising a contiguous portion of the candidate document that is of the predetermined length and shorter than the candidate document;
determining a degree of matching between the stored reference shingles and the candidate shingles; and
computing a similarity index representing similarity between the reference document and the candidate document by an equation
2 Assignments
0 Petitions
Accused Products
Abstract
A mechanism for efficiently determining similarity between documents. A set of reference data items is generated by processing a reference document. A similarity index representing similarity between a candidate document and the reference documents is obtained by counting segments of the candidate document matching the reference data items. The candidate document is a message transmitted in a communication system where the message is compared against one or more reference documents representing unwanted messages to filter and block unwanted messages from being transmittal or propagated.
33 Citations
19 Claims
-
1. A computer-implemented method comprising:
-
segmenting at least a portion of a reference document into a plurality of reference shingles, each reference shingle comprising a contiguous portion of the reference document that is of a predetermined length and shorter than the reference document, the plurality of reference shingles comprising a first series of reference shingles and a second series of reference shingles, the first series of reference shingles overlapping and having shifted starting locations relative to the second series of reference shingles; storing the plurality of reference shingles in a trie structure and counts indicating a number of times each of the plurality of reference shingles comprising the first series of reference shingles and the second series of reference shingles appear in the plurality of reference shingles, a number of levels in the trie structure corresponding to a number of characters in each reference shingle; segmenting at least a portion of a candidate document into a plurality of candidate shingles comprising a contiguous portion of the candidate document that is of the predetermined length and shorter than the candidate document; determining a degree of matching between the stored reference shingles and the candidate shingles; and computing a similarity index representing similarity between the reference document and the candidate document by an equation - View Dependent Claims (2, 3, 4, 5)
-
-
6. A computer-implemented method comprising:
-
receiving a first reference document identified as unwanted; receiving a communication transmitted through a communication channel; segmenting at least a portion of the first reference document into a plurality of first reference shingles, each of the first reference shingles a contiguous portion of the first reference document that is of a predetermined length and shorter than the first reference document, the plurality of first reference shingles comprising a first series of reference shingles and a second series of reference shingles, the first series of reference shingles overlapping and having shifted starting locations relative to the second series of reference shingles; storing the plurality of reference shingles in a trie structure and counts indicating a number of times each of the plurality of reference shingles comprising the first series of reference shingles and the second series of reference shingles appear in the plurality of reference shingles, a number of levels in the trie structure corresponding to a number of characters in each reference shingle; segmenting at least a portion of the communication into a plurality of candidate shingles comprising a contiguous portion of the communication that is of the predetermined length and shorter than the communication; determining a degree of matching between the first reference shingles and the candidate shingles; computing a first similarity index by an equation - View Dependent Claims (7, 8, 9, 10, 11)
-
-
12. A communication system comprising:
a communication processor configured to compare a communication with a plurality of first reference shingles associated with a first reference document, the first reference document identified as unwanted and generate a first similarity index representing similarity between the communication and the first reference document, the communication processor configured to determine whether the communication matches the first reference document based on the first similarity index, the communication processor configured to designate the communication to be blocked from further transmission through a communication channel responsive to determining that the communication matches the first reference document based on the first similarity index, the communication processor further configured to; segment at least a portion of the first reference document into a plurality of reference shingles, each of the first reference shingles comprising a contiguous portion of the first reference document that is of a predetermined length and shorter than the first reference document, the plurality of first reference shingles comprising a first series of reference shingles and a second series of reference shingles, the first series of reference shingles overlapping and having shifted starting locations relative to the second series of reference shingles; store the plurality of reference shingles in a trie structure and counts indicating a number of times each of the plurality of reference shingles comprising the first series of reference shingles and the second series of reference shingles appear in the plurality of reference shingles, a number of levels in the trie structure corresponding to a number of characters in each reference shingle; segment at least a portion of the communication into a plurality of candidate shingles comprising a contiguous portion of the communication that is of the predetermined length and shorter than the communication; determine a degree of matching between the first reference shingles and the candidate shingles; and compute the first similarity index by an equation - View Dependent Claims (13, 14, 15, 16, 17)
-
18. A non-transitory computer readable storage medium storing instructions, the instructions when executed by a processor, cause the processor to:
-
receive a first reference document identified as unwanted; receive a communication to be routed or published in a communication channel controlled by the processor; segment at least a portion of the first reference document into a plurality of first reference shingles, each of the first reference shingles comprising a contiguous portion of the first reference document that is of a predetermined length and shorter than the first reference document, the plurality of first reference shingles comprising a first series of reference shingles and a second series of reference shingles, the first series of reference shingles overlapping and having shifted starting locations relative to the second series of reference shingles; store the plurality of reference shingles in a trie structure and counts indicating a number of times each of the plurality of reference shingles comprising the first series of reference shingles and the second series of reference shingles appear in the plurality of reference shingles, a number of levels in the trie structure corresponding to a number of characters in each reference shingle; segment at least a portion of the communication into a plurality of candidate shingles comprising a contiguous portion of the communication that is of the predetermined length and shorter than the communication; compute a first similarity index by an equation
-
-
19. A computer-implemented method comprising:
-
segmenting at least a portion of a reference document into a plurality of reference shingles, each reference shingle comprising a contiguous portion of the reference document that is of a predetermined length and shorter than the reference document; storing the plurality of reference shingles in a trie structure and counts indicating a number of times each of the plurality of reference shingles comprising the first series of reference shingles and the second series of reference shingles appear in the plurality of reference shingles, a number of levels in the trie structure corresponding to a number of characters in each reference shingle; segmenting at least a portion of a candidate document into a plurality of candidate shingles, each candidate shingle comprising a contiguous portion of the candidate document that is of the predetermined length and shorter than the candidate document, the plurality of candidate shingles comprising a first series of candidate shingles and a second series of candidate shingles, the first series of candidate shingles overlapping with and having shifted starting locations relative to the second series of candidate shingles; determining a degree of matching between the reference shingles and the candidate shingles; and computing a similarity index representing similarity between the reference document and the candidate document by an equation
-
Specification