Partial document content matching using sectional analysis
First Claim
1. A method comprising:
- computing a set of reference sectional fingerprints corresponding to a reference document having a classification selected from a group comprising at least a public classification and a private classification, wherein successful access to the reference document by discovery agent software results in the reference document having the public classification and otherwise having a classification other than the public classification, and at least one of the reference sectional fingerprints is based at least in part on two or more tokens of the reference document, the tokens being selected from remaining words of the reference document after exclusion of a set of language-dependent words of the reference document;
associating the reference sectional fingerprints with the classification of the reference document;
after the associating, monitoring network traffic via a network interface;
computing a set of traffic sectional fingerprints corresponding to the monitored network traffic, wherein at least one of the traffic sectional fingerprints is based at least in part on two or more tokens of the monitored network traffic;
determining that at least one of the traffic sectional fingerprints matches at least one of the reference sectional fingerprints;
for each respective traffic sectional fingerprint matching at least one of the reference sectional fingerprints associated with the public classification, classifying the respective traffic sectional fingerprint as the public classification;
for each respective traffic sectional fingerprint matching none of the reference sectional fingerprints associated with the public classification and matching at least one of the reference sectional fingerprints associated with the private classification, classifying the respective traffic sectional fingerprint as the private classification;
wherein the act of associating, the acts of computing, and the act of determining are at least in part via one or more central processing units enabled to execute software;
wherein the reference sectional fingerprints and the traffic sectional fingerprints are sliding sectional fingerprints;
wherein the reference document is interpreted as groups of contiguous token strings and each reference sliding sectional fingerprint corresponds to one of the groups of reference document contiguous token strings; and
wherein the monitored network traffic is interpreted as groups of contiguous token strings and each traffic sliding sectional fingerprint corresponds to one of the groups of monitored traffic contiguous token strings.
15 Assignments
0 Petitions
Accused Products
Abstract
Monitored content is classified to determine partial matches with fragments of documents. A set of redundant keys, or sliding sectional fingerprints, are computed for every possible alignment of the documents with respect to the monitored content. The keys are stored in repositories according to the classification of the corresponding documents. Sectional fingerprints are computed for the monitored content, and the repositories are searched. If a match is found in a repository corresponding to public content, then the monitored data section is classified as public. If a match is found only in a repository corresponding to private content, then the data section is classified as private. Otherwise, the data section is classified as unknown. In a related aspect, a set of policies are searched for a first match in part according to the classifications of the monitored data sections, and a designated action taken if the first match is found.
87 Citations
67 Claims
-
1. A method comprising:
-
computing a set of reference sectional fingerprints corresponding to a reference document having a classification selected from a group comprising at least a public classification and a private classification, wherein successful access to the reference document by discovery agent software results in the reference document having the public classification and otherwise having a classification other than the public classification, and at least one of the reference sectional fingerprints is based at least in part on two or more tokens of the reference document, the tokens being selected from remaining words of the reference document after exclusion of a set of language-dependent words of the reference document; associating the reference sectional fingerprints with the classification of the reference document; after the associating, monitoring network traffic via a network interface; computing a set of traffic sectional fingerprints corresponding to the monitored network traffic, wherein at least one of the traffic sectional fingerprints is based at least in part on two or more tokens of the monitored network traffic; determining that at least one of the traffic sectional fingerprints matches at least one of the reference sectional fingerprints; for each respective traffic sectional fingerprint matching at least one of the reference sectional fingerprints associated with the public classification, classifying the respective traffic sectional fingerprint as the public classification; for each respective traffic sectional fingerprint matching none of the reference sectional fingerprints associated with the public classification and matching at least one of the reference sectional fingerprints associated with the private classification, classifying the respective traffic sectional fingerprint as the private classification; wherein the act of associating, the acts of computing, and the act of determining are at least in part via one or more central processing units enabled to execute software; wherein the reference sectional fingerprints and the traffic sectional fingerprints are sliding sectional fingerprints; wherein the reference document is interpreted as groups of contiguous token strings and each reference sliding sectional fingerprint corresponds to one of the groups of reference document contiguous token strings; and wherein the monitored network traffic is interpreted as groups of contiguous token strings and each traffic sliding sectional fingerprint corresponds to one of the groups of monitored traffic contiguous token strings. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25)
-
-
26. A method comprising:
-
computing document sliding sectional fingerprints corresponding to a first and a second document, wherein successful access to the first document by discovery agent software results in the first document having a public classification and unsuccessful access to the second document by the discovery agent software results in the second document having a private classification, and at least one of the document sliding sectional fingerprints is based at least in part on two or more tokens of at least one of the documents, the tokens being selected from remaining words of the at least one document after exclusion of a set of language-dependent words of the at least one document; associating the document sliding sectional fingerprints with the classification of the corresponding document; after the associating, monitoring network traffic via a network interface; computing a set of traffic sliding sectional fingerprints corresponding to the monitored network traffic, wherein at least one of the traffic sliding sectional fingerprints is based at least in part on two or more tokens of the monitored network traffic; for each of the traffic sliding sectional fingerprints classifying the traffic sliding sectional fingerprint as the public classification in response to the traffic sliding sectional fingerprint matching at least one of the document sliding sectional fingerprints associated with the public classification, and classifying the traffic sliding sectional fingerprint as the private classification in response to the traffic sliding sectional fingerprint matching none of the document sliding sectional fingerprints associated with the public classification and matching at least one of the document sliding sectional fingerprints associated with the private classification; wherein the acts of computing, the act of associating, and the acts of classifying are at least in part via one or more central processing units enabled to execute software; wherein each document is interpreted as groups of contiguous token sections and each document sliding sectional fingerprint corresponds to one of the groups of document contiguous token sections; and wherein the monitored network traffic is interpreted as groups of contiguous token sections and each traffic sliding sectional fingerprint corresponds to one of the groups of monitored traffic contiguous token sections. - View Dependent Claims (27, 28, 29, 30, 31, 32)
-
-
33. A system comprising:
-
a sliding sectional fingerprint unit implemented at least in part via a hardware accelerator and enabled to compute a set of document sliding sectional fingerprints corresponding to a document having a classification selected from a group comprising at least a public classification and a private classification, wherein successful access to the document by discovery agent software results in the document having the public classification and otherwise having a classification other than the public classification, the sliding sectional fingerprint unit being further enabled to compute at least one of the document sliding sectional fingerprints based at least in part on two or more tokens of the document, the tokens being selected from remaining words of the document after exclusion of a set of language-dependent words of the document, and the sliding sectional fingerprint unit being further enabled to associate the document sliding sectional fingerprints with the classification of the document; a network traffic monitoring and analyzing unit implemented at least in part via the hardware accelerator, coupled to the sliding sectional fingerprint unit, and enabled to compute a set of traffic sliding sectional fingerprints corresponding to monitored network traffic, at least one of the traffic sliding sectional fingerprints being based at least in part on two or more tokens of the monitored network traffic; wherein the network traffic monitoring and analyzing unit is further enabled to determine that at least one of the traffic sliding sectional fingerprints matches at least one of the document sliding sectional fingerprints; wherein for each respective traffic sliding sectional fingerprint, the network traffic monitoring and analyzing unit is further enabled to classify the respective traffic sliding sectional fingerprint as the public classification in response to the respective traffic sliding sectional fingerprint matching at least one of the document sliding sectional fingerprints associated with the public classification, and as the private classification in response to the respective traffic sliding sectional fingerprint matching none of the document sliding sectional fingerprints associated with the public classification and matching at least one of the document sliding sectional fingerprints associated with the private classification; wherein the document is interpreted as groups of contiguous token sections and each document sliding sectional fingerprint corresponds to one of the groups of document contiguous token sections; and wherein the monitored network traffic is interpreted as groups of contiguous token sections and each traffic sliding sectional fingerprint corresponds to one of the groups of monitored traffic contiguous token sections. - View Dependent Claims (34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54)
-
-
55. A system comprising:
-
a content appliance comprising a repository having a first portion and a second portion corresponding respectively to a public classification and a private classification, a processor coupled to the repository and enabled to execute content appliance software, and a network interface coupled to the processor; a computer coupled to the content appliance, the computer enabled to execute computer software; wherein the content appliance software comprises functions enabling receiving keys and corresponding file classification from the computer and storing the keys in a portion of the repository selected according to the file classification, sampling network traffic via the network interface, computing traffic sliding sectional fingerprints based on the sampled network traffic, at least one of the traffic sliding sectional fingerprints being based at least in part on two or more tokens of the sampled network traffic, and classifying each traffic sliding sectional fingerprint computed, as the public classification in response to the traffic sliding sectional fingerprint matching any of the keys in the first portion of the repository, as the private classification in response to the traffic sliding sectional fingerprint not matching any of the keys in the first portion of the repository and matching any of the keys in the second portion of the repository, and as a third classification otherwise; wherein the computer software comprises functions enabling receiving a file and computing a corresponding set of file sliding sectional fingerprints, at least one of the file sliding sectional fingerprints being computed based at least in part on two or more tokens of the file, the tokens being selected from remaining words of the file after exclusion of a set of language-dependent words of the file, and providing the file sliding sectional fingerprints and a classification of the file to the content appliance as a set of keys, the classification being the public classification as a result of the file being successfully accessed by discovery agent software and being a classification other than the public classification as a result of the file not being successfully accessed by the discovery agent software; wherein the file is interpreted as groups of contiguous token sections and each file sliding sectional fingerprint corresponds to one of the groups of file contiguous token sections; and wherein the sampled network traffic is interpreted as groups of contiguous token sections and each traffic sliding sectional fingerprint corresponds to one of the groups of sampled traffic contiguous token sections. - View Dependent Claims (56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67)
-
Specification