Generating signatures over a document
First Claim
Patent Images
1. A method for generating a plurality of signatures over a document, the method comprising:
- extracting content from the document;
normalizing the extracted content;
generating the plurality of signatures using the normalized content; and
tokenizing the normalized content prior to generating the plurality of signatures;
wherein tokenizing the normalized content comprises creating an ordered list of tokens by converting each delimited item in the normalized content into a token;
wherein generating the plurality of signatures comprises;
i) selecting M consecutive tokens from the ordered list of tokens,M being a positive integer;
ii) selecting N special tokens from the M consecutive tokens, N being a positive integer not greater than M;
iii) generating a signature by calculating a hash over the N special tokens;
iv) skipping ahead P tokens from the first of the M consecutive tokens, P being a positive integer; and
v) repeating i), ii), iii), and iv) until the plurality of signatures have been generated.
14 Assignments
0 Petitions
Accused Products
Abstract
A document accessible over a network can be registered. A registered document, and the content contained therein, cannot be transmitted undetected over and off of the network. In one embodiment, the invention includes maintaining a plurality of stored signatures over a registered document. In one embodiment, the plurality of stored signatures are generated by extracting content from the document, normalizing the extracted content, and generating the plurality of signatures using the normalized content.
83 Citations
27 Claims
-
1. A method for generating a plurality of signatures over a document, the method comprising:
-
extracting content from the document; normalizing the extracted content;
generating the plurality of signatures using the normalized content; and
tokenizing the normalized content prior to generating the plurality of signatures;wherein tokenizing the normalized content comprises creating an ordered list of tokens by converting each delimited item in the normalized content into a token; wherein generating the plurality of signatures comprises; i) selecting M consecutive tokens from the ordered list of tokens, M being a positive integer; ii) selecting N special tokens from the M consecutive tokens, N being a positive integer not greater than M; iii) generating a signature by calculating a hash over the N special tokens; iv) skipping ahead P tokens from the first of the M consecutive tokens, P being a positive integer; and v) repeating i), ii), iii), and iv) until the plurality of signatures have been generated. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. An apparatus comprising:
-
an extractor/decoder to extract content from a document; a normalizer to normalize the extracted content; a signature generator to generate a plurality of signatures using the normalized content; and a tokenizer configured to create an ordered list of tokens by converting each delimited item in the normalized content into a token; wherein the signature generator is configured to generate the plurality of signature by; i) selecting M consecutive tokens from the ordered list of tokens, M being a positive integer; ii) selecting N special tokens from the M consecutive tokens, N being a positive integer not greater than M; iii) generating a signature by calculating a hash over the N special tokens; iv) skipping ahead P tokens from the first of the M consecutive tokens, P being a positive integer; and v) repeating i), ii), iii), and iv) until the plurality of signatures have been generated. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20)
-
-
21. A tangible computer readable medium having stored thereon data representing instructions that, when executed by a processor, cause the processor to perform operations comprising:
-
extracting content from the document; normalizing the extracted content; and generating a plurality of signatures using the normalized content; wherein the instructions further cause the processor to tokenize the normalized content prior to generating the plurality of signatures; wherein tokenizing the normalized content comprises creating an ordered list of tokens by converting each delimited item in the normalized content into a token; and wherein generating the plurality of signatures comprises; i) selecting M consecutive tokens from the ordered list of tokens, M being a positive integer; ii) selecting N special tokens from the M consecutive tokens, N being a positive integer not greater than M; iii) generating a signature by calculating a hash over the N special tokens; iv) skipping ahead P tokens from the first of the M consecutive tokens, P being a positive integer; and v) repeating i), ii), iii), and iv) until the plurality of signatures have been generated. - View Dependent Claims (22, 25, 26)
-
-
23. The tangible computer readable medium of 21, wherein extracting the content from the document further comprises decoding the content contained in the document.
-
24. The tangible computer readable medium of 21, wherein normalizing the extracted content comprises eliminating irrelevant text from the extracted content.
-
27. The tangible computer readable medium of 21, wherein the instructions further cause the processor to compare the plurality of signatures with signatures stored in a signature database, and to observe registered content contained in the document based on the comparison.
Specification