Document fingerprinting with asymmetric selection of anchor points
First Claim
1. A computer-implemented process for generating document fingerprints, the method being performed using a computer including at least a processor, data storage, and computer-readable instructions, and the method comprising:
- normalizing a document to create a normalized text string;
applying a first hash function with a sliding hash window to the normalized text string to generate an array of hash values;
applying a first filter to the array of hash values to select candidate anchoring points;
applying a second filter to the candidate anchoring points to select anchoring points; and
applying a second hash function to substrings located at the selected anchoring points to generate hash values for use as fingerprints of the document.
2 Assignments
0 Petitions
Accused Products
Abstract
One embodiment relates to a computer-implemented process for generating document fingerprints. A document is normalized to create a normalized text string. A first hash function with a sliding hash window is applied to the normalized text string to generate an array of hash values. Candidate anchoring points are selected by applying a first filter to the array of hash values. The anchoring points are chosen by applying a second filter to the candidate anchoring points. Finally, a second hash function is applied to substrings located at the chosen anchoring points to generate hash values for use as fingerprints for the document. Other embodiments and aspects are also disclosed.
38 Citations
21 Claims
-
1. A computer-implemented process for generating document fingerprints, the method being performed using a computer including at least a processor, data storage, and computer-readable instructions, and the method comprising:
-
normalizing a document to create a normalized text string; applying a first hash function with a sliding hash window to the normalized text string to generate an array of hash values; applying a first filter to the array of hash values to select candidate anchoring points; applying a second filter to the candidate anchoring points to select anchoring points; and applying a second hash function to substrings located at the selected anchoring points to generate hash values for use as fingerprints of the document. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A computer-implemented method for document leakage prevention, the method comprising:
-
applying a first procedure to sensitive documents to be protected, the first procedure using a first filter and a second filter to choose anchoring points and generating fingerprints based on the chosen anchoring points; and applying a second procedure to a target document to be inspected for sensitive text, the second procedure using the first filter, but not the second filter, to select anchoring points and generating target fingerprints based on the selected anchoring points. - View Dependent Claims (11, 12)
-
-
13. A computer apparatus configured to generate document fingerprints, the apparatus comprising:
-
data storage configured to store computer-readable instruction code and data; a processor configured to access the data storage and to execute said computer-readable instruction code; computer-readable instruction code configured to normalize a computer-readable document to create a normalized text string; computer-readable instruction code configured to apply a first hash function with a sliding hash window to the normalized test string to generate an array of hash values; computer-readable instruction code configured to apply a first filter to the array of hash values to select candidate anchoring points; computer-readable instruction code configured to apply a second filter to the candidate anchoring points to select anchoring points; and computer-readable instruction code configured to apply a second hash function to substrings located at the selected anchoring points to generate hash values for use as fingerprints of the document. - View Dependent Claims (14, 15, 16, 17, 18, 19)
-
-
20. A data leakage prevention apparatus implemented on a distributed network of computer apparatus, the data leakage prevention apparatus comprising:
-
a server computer configured to apply a first procedure to sensitive documents to be protected, the first procedure using a first filter and a second filter to choose anchoring points and generating fingerprints based on the chosen anchoring points; and a plurality of end-point computers configured to each apply a second procedure to a target document to be inspected for sensitive text, the second procedure using the first filter, but not the second filter, to select anchoring points and generating target fingerprints based on the selected anchoring points. - View Dependent Claims (21)
-
Specification