Data quality assessment for vector machine learning
First Claim
Patent Images
1. A method, implemented by a computing device, comprising:
- receiving a training data set that comprises a plurality of sensitive documents and a plurality of non-sensitive documents;
determining, by the computing device, a quality of the training data set, wherein determining the quality of the training data set comprises performing at least one of k-fold cross validation or latent semantic indexing using the training data set;
in response to determining that the training data set has a satisfactory quality, analyzing, by the computing device, the training data set using machine learning to generate a machine learning-based detection (MLD) profile, the MLD profile to be used by a data loss prevention (DLP) system to classify new documents as sensitive documents or as non-sensitive documents; and
in response to determining that the training data set does not have satisfactory quality, identifying at least one document from the training data set that caused the quality of the training data set to be reduced.
2 Assignments
0 Petitions
Accused Products
Abstract
A computing device receives a training data set that comprises a plurality of sensitive documents and a plurality of non-sensitive documents. The computing device determines a quality of the training data set. The quality may be determined using k-fold cross validation and/or latent semantic indexing. In response to determining that the training data set has a satisfactory quality, the computing device then analyzes the training data set using machine learning to train a machine learning-based detection (MLD) profile, the MLD profile to be used by a data loss prevention (DLP) system to classify new documents as sensitive documents or as non-sensitive documents.
41 Citations
18 Claims
-
1. A method, implemented by a computing device, comprising:
-
receiving a training data set that comprises a plurality of sensitive documents and a plurality of non-sensitive documents; determining, by the computing device, a quality of the training data set, wherein determining the quality of the training data set comprises performing at least one of k-fold cross validation or latent semantic indexing using the training data set; in response to determining that the training data set has a satisfactory quality, analyzing, by the computing device, the training data set using machine learning to generate a machine learning-based detection (MLD) profile, the MLD profile to be used by a data loss prevention (DLP) system to classify new documents as sensitive documents or as non-sensitive documents; and in response to determining that the training data set does not have satisfactory quality, identifying at least one document from the training data set that caused the quality of the training data set to be reduced. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A non-transitory computer readable storage medium including instructions that, when executed by a processing device, cause the processing device to perform operations comprising:
-
receiving a training data set that comprises a plurality of sensitive documents and a plurality of non-sensitive documents; determining, by the processing device, a quality of the training data set, wherein determining the quality of the training data set comprises performing at least one of k-fold cross validation or latent semantic indexing using the training data set; and in response to determining that the training data set has a satisfactory quality, analyzing, by the processing device, the training data set using machine learning to generate a machine learning-based detection (MLD) profile, the MLD profile to be used by a data loss prevention (DLP) system to classify new documents as sensitive documents or as non-sensitive documents. - View Dependent Claims (8, 9, 10, 11, 12, 13)
-
-
14. A computing device comprising:
-
a memory to store instructions for performing machine learning; and a processing device, coupled to the memory, to execute the instructions, wherein the processing device is to; receive a training data set that comprises a plurality of sensitive documents and a plurality of non-sensitive documents; determine a quality of the training data set, wherein determining the quality of the training data set comprises performing at least one of k-fold cross validation or latent semantic indexing using the training data set; in response to determining that the training data set has a satisfactory quality, analyze the training data set using machine learning to generate a machine learning based detection (MLD) profile, the MLD profile to be used by a data loss prevention (DLP) system to classify new documents as sensitive documents or as non-sensitive documents; and in response to determining that the training data set does not have satisfactory quality, identify at least one document from the training data set that caused the quality of the training data set to be reduced. - View Dependent Claims (15, 16, 17, 18)
-
Specification