Matching engine with signature generation
First Claim
1. A method for generating a plurality of signatures associated with a document, the method comprising:
- receiving a document comprising text;
parsing the document to generate a token set comprising a plurality of tokens, each token corresponding to the text in the document separated by a predefined character characteristic;
calculating a score for each token in the token set based on a frequency and distribution of occurrences of the text in the document;
ranking each token in the token set based on the calculated score;
selecting, from the ranked tokens, a predetermined number of top ranked tokens to limit a number of signatures generated for said document; and
generating a signature for each occurrence of the selected tokens,wherein said score calculation is such that a more even distribution of the occurrences of the text in the document results in a higher score, andwherein said score for each token is proportional to a first quantity divided by a second quantity, further wherein the first quantity comprises a position of a last occurrence of the text in the document minus a position of a first occurrence of the text in the document, and further wherein the second quantity comprises a square root of a sum of squares of differences in positions between adjacent occurrences of the text in the document.
5 Assignments
0 Petitions
Accused Products
Abstract
A system and a method generates at least one signature associated with document. In one embodiment, a document comprised of text is received and parsed to generate a token set. The token set includes a plurality of tokens. Each token corresponds to the text in the document that is separated by a predefined character characteristic. A score is calculated for each token in the token set based on a frequency and distribution of the text in the document. Each token is then ranked based on the calculated score. A subset of the ranked tokes is selected and a signature is generated for each occurrence of the selected tokens. The selected list of signatures is then output.
49 Citations
10 Claims
-
1. A method for generating a plurality of signatures associated with a document, the method comprising:
-
receiving a document comprising text; parsing the document to generate a token set comprising a plurality of tokens, each token corresponding to the text in the document separated by a predefined character characteristic; calculating a score for each token in the token set based on a frequency and distribution of occurrences of the text in the document; ranking each token in the token set based on the calculated score; selecting, from the ranked tokens, a predetermined number of top ranked tokens to limit a number of signatures generated for said document; and generating a signature for each occurrence of the selected tokens, wherein said score calculation is such that a more even distribution of the occurrences of the text in the document results in a higher score, and wherein said score for each token is proportional to a first quantity divided by a second quantity, further wherein the first quantity comprises a position of a last occurrence of the text in the document minus a position of a first occurrence of the text in the document, and further wherein the second quantity comprises a square root of a sum of squares of differences in positions between adjacent occurrences of the text in the document. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A computer readable storage medium storing instructions executable by a processor, the instructions when executed causing a processor to:
-
receive a document comprising text; parse the document to generate a token set comprising a plurality of tokens, each token corresponding to the text in the document separated by a predefined character characteristic; calculate a score for each token in the token set based on a frequency and distribution of the text in the document; rank each token in the token set based on the calculated score; select, from the ranked tokens a predetermined number of top ranked tokens to limit a number of signatures generated for said document; and generate a signature for each occurrence of the selected tokens, wherein said score calculation is such that a more even distribution of the occurrences of the text in the document results in a higher score, and wherein said score for each token is proportional to a first quantity divided by a second quantity, further wherein the first quantity comprises a position of a last occurrence of the text in the document minus a position of a first occurrence of the text in the document, and further wherein the second quantity comprises a square root of a sum of squares of differences in positions between adjacent occurrences of the text in the document. - View Dependent Claims (7, 8, 9, 10)
-
Specification