Systems and methods for classifying documents for data loss prevention
First Claim
1. A computer-implemented method for classifying documents for data loss prevention, at least a portion of the method being performed by a computing device comprising at least one processor, the method comprising:
- identifying a set of prospective training documents for a machine learning classifier that is configured to provide input for data loss prevention determinations, the set of prospective training documents comprising documents regarded as sensitive;
performing a semantic analysis on the set of prospective training documents to identify a plurality of topics within the set of prospective training documents;
applying a similarity metric to the plurality of topics to identify at least one unrelated topic within the plurality of topics with a similarity to the other topics within the plurality of topics, as determined by the similarity metric, that falls below a predetermined similarity threshold, and to thereby determine that the unrelated topic is unrelated to the other topics within the plurality of topics;
identifying, based at least in part on the semantic analysis, at least one irrelevant prospective training document within the set of prospective training documents in which a predominance of the unrelated topic is above a predetermined predominance threshold by determining the predominance of the unrelated topic by the presence of the unrelated topic within the irrelevant prospective training document as identified in the semantic analysis and not by the presence within the irrelevant prospective training document of any related topic within the plurality of topics;
excluding the irrelevant prospective training document from the set of prospective training documents based at least in part on the predominance of the unrelated topic within the irrelevant prospective training document.
2 Assignments
0 Petitions
Accused Products
Abstract
A computer-implemented method for classifying documents for data loss prevention may include 1) identifying training documents for a machine learning classifier configured for data loss prevention, 2) performing a semantic analysis on training documents to identify topics within the set training documents, 3) applying a similarity metric to the topics to identify at least one unrelated topic with a similarity to the other topics within the plurality of topics, as determined by the similarity metric, that falls below a similarity threshold, 4) identifying, based on the semantic analysis, at least one irrelevant training document within the set of training documents in which a predominance of the unrelated topic is above a predominance threshold, and 5) excluding the irrelevant training document from the set of training documents based on the predominance of the unrelated topic within the irrelevant training document. Various other methods, systems, and computer-readable media are also disclosed.
20 Citations
20 Claims
-
1. A computer-implemented method for classifying documents for data loss prevention, at least a portion of the method being performed by a computing device comprising at least one processor, the method comprising:
-
identifying a set of prospective training documents for a machine learning classifier that is configured to provide input for data loss prevention determinations, the set of prospective training documents comprising documents regarded as sensitive; performing a semantic analysis on the set of prospective training documents to identify a plurality of topics within the set of prospective training documents; applying a similarity metric to the plurality of topics to identify at least one unrelated topic within the plurality of topics with a similarity to the other topics within the plurality of topics, as determined by the similarity metric, that falls below a predetermined similarity threshold, and to thereby determine that the unrelated topic is unrelated to the other topics within the plurality of topics; identifying, based at least in part on the semantic analysis, at least one irrelevant prospective training document within the set of prospective training documents in which a predominance of the unrelated topic is above a predetermined predominance threshold by determining the predominance of the unrelated topic by the presence of the unrelated topic within the irrelevant prospective training document as identified in the semantic analysis and not by the presence within the irrelevant prospective training document of any related topic within the plurality of topics; excluding the irrelevant prospective training document from the set of prospective training documents based at least in part on the predominance of the unrelated topic within the irrelevant prospective training document. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A system for classifying documents for data loss prevention, the system comprising:
-
an identification module programmed to identify a set of prospective training documents for a machine learning classifier that is configured to provide input for data loss prevention determinations, the set of prospective training documents comprising documents regarded as sensitive; an analysis module programmed to perform a semantic analysis on the set of prospective training documents to identify a plurality of topics within the set of prospective training documents; a similarity module programmed to apply a similarity metric to the plurality of topics to identify at least one unrelated topic within the plurality of topics with a similarity to the other topics within the plurality of topics, as determined by the similarity metric, that falls below a predetermined similarity threshold, and to thereby determine that the unrelated topic is unrelated to the other topics within the plurality of topics; a predominance module programmed to identify, based at least in part on the semantic analysis, at least one irrelevant prospective training document within the set of prospective training documents in which a predominance of the unrelated topic is above a predetermined predominance threshold by determining the predominance of the unrelated topic by the presence of the unrelated topic within the irrelevant prospective training document as identified in the semantic analysis and not by the presence within the irrelevant prospective training document of any related topic within the plurality of topics; an exclusion module programmed to exclude the irrelevant prospective training document from the set of prospective training documents based at least in part on the predominance of the unrelated topic within the irrelevant prospective training document; at least one processor configured to execute the identification module, the analysis module, the similarity module, the predominance module, and the exclusion module. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19)
-
-
20. A non-transitory computer-readable-storage medium comprising one or more computer-executable instructions that, when executed by at least one processor of a computing device, cause the computing device to:
-
identify a set of prospective training documents for a machine learning classifier that is configured to provide input for data loss prevention determinations, the set of prospective training documents comprising documents regarded as sensitive; perform a semantic analysis on the set of prospective training documents to identify a plurality of topics within the set of prospective training documents; apply a similarity metric to the plurality of topics to identify at least one unrelated topic within the plurality of topics with a similarity to the other topics within the plurality of topics, as determined by the similarity metric, that falls below a predetermined similarity threshold, and to thereby determine that the unrelated topic is unrelated to the other topics within the plurality of topics; identify, based at least in part on the semantic analysis, at least one irrelevant prospective training document within the set of prospective training documents in which a predominance of the unrelated topic is above a predetermined predominance threshold by determining the predominance of the unrelated topic by the presence of the unrelated topic within the irrelevant prospective training document as identified in the semantic analysis and not by the presence within the irrelevant prospective training document of any related topic within the plurality of topics; exclude the irrelevant prospective training document from the set of prospective training documents based at least in part on the predominance of the unrelated topic within the irrelevant prospective training document.
-
Specification