METHODS FOR DOCUMENT-TO-TEMPLATE MATCHING FOR DATA-LEAK PREVENTION
First Claim
1. A method for document-to-template matching for data-leak prevention (DLP), the method comprising the steps of.(a) providing a document as a stream of characters;
- (b) splitting said stream into a plurality of serialized data lines;
(c) calculating a hash value for each said serialized data line;
(d) checking for each said hash value in a hash map of a template set;
(e) determining a similarity match to a particular template based on a predefined threshold of template hash values, of said template set, being found in said stream; and
(f) based on said similarity match, executing a DLP security policy for said document.
1 Assignment
0 Petitions
Accused Products
Abstract
The present invention discloses methods for document-to-template matching for data-leak prevention (DLP), the methods including the steps of: providing a document as a stream of characters; splitting the stream into a plurality of serialized data lines; calculating a hash value for each serialized data line; checking for each hash value in a hash map of a template set; determining a similarity match to a particular template based on a predefined threshold of template hash values, of the template set, being found in the stream; and based on the similarity match, executing a DLP security policy for the document. Preferably, the template set is extracted from documents manually prepared by a security administrator. Preferably, each template in the template set is deduced automatically from a plurality of documents.
50 Citations
14 Claims
-
1. A method for document-to-template matching for data-leak prevention (DLP), the method comprising the steps of.
(a) providing a document as a stream of characters; -
(b) splitting said stream into a plurality of serialized data lines; (c) calculating a hash value for each said serialized data line; (d) checking for each said hash value in a hash map of a template set; (e) determining a similarity match to a particular template based on a predefined threshold of template hash values, of said template set, being found in said stream; and (f) based on said similarity match, executing a DLP security policy for said document. - View Dependent Claims (2, 3, 4)
-
-
5. A method for document-to-template matching by designating multiple documents for use as a template for data-leak prevention (DLP), the method comprising the steps of:
-
(a) providing a plurality of documents as a stream of characters; (b) splitting said stream into a plurality of serialized data lines; (c) inserting said plurality of serialized data lines into a list; (d) grouping duplicate serialized data lines in said list with an indication of a frequency of occurrence for each said serialized data line in said stream; (e) eliminating serialized data lines having a threshold frequency below a predefined threshold from said list; (f) grouping remaining serialized data lines to represent the template; (g) calculating a hash value for each said serialized data line in the template; (h) inserting each said hash value into a hash map of a template set; (i) checking for hash values of a new document in said hash map; (j) determining a similarity match to a particular template based on a predefined threshold of template hash values, of said template set, being found in said new document; and (k) based on said similarity match, executing a DLP security policy for said new document. - View Dependent Claims (6, 7)
-
-
8. A computer-readable storage medium having computer-readable code embodied on the computer-readable storage medium, the computer-readable code comprising:
-
(a) program code for providing a document as a stream of characters; (b) program code for splitting said stream into a plurality of serialized data lines; (c) program code for calculating a hash value for each said serialized data line; (d) program code for checking for each said hash value in a hash map of a template set; (e) program code for determining a similarity match to a particular template based on a predefined threshold of template hash values, of said template set, being found in said stream; and (f) program code for, based on said similarity match, executing a security policy for said document. - View Dependent Claims (9, 10, 11)
-
-
12. A computer-readable storage medium having computer-readable code embodied on the computer-readable storage medium, the computer-readable code comprising:
-
(a) program code for providing a plurality of documents as a stream of characters; (b) program code for splitting said stream into a plurality of serialized data lines; (c) program code for inserting said plurality of serialized data lines into a list; (d) program code for grouping duplicate serialized data lines in said list with an indication of a frequency of occurrence for each said serialized data line in said stream; (e) program code for eliminating serialized data lines having a threshold frequency below a predefined threshold from said list; (f) program code for grouping remaining serialized data lines to represent the template; (g) program code for calculating a hash value for each said serialized data line in the template; (h) program code for inserting each said hash value into a hash map of a template set; (i) program code for checking for hash values of a new document in said hash map; (j) program code for determining a similarity match to a particular template based on a predefined threshold of template hash values, of said template set, being found in said new document; and (k) program code for, based on said similarity match, executing a security policy for said new document. - View Dependent Claims (13, 14)
-
Specification