Data quality assessment for vector machine learning

US 9,015,082 B1
Filed: 12/14/2011
Issued: 04/21/2015
Est. Priority Date: 12/14/2010
Status: Active Grant

First Claim

Patent Images

1. A method, implemented by a computing device, comprising:

receiving a training data set that comprises a plurality of sensitive documents and a plurality of non-sensitive documents;

determining, by the computing device, a quality of the training data set, wherein determining the quality of the training data set comprises performing at least one of k-fold cross validation or latent semantic indexing using the training data set;

in response to determining that the training data set has a satisfactory quality, analyzing, by the computing device, the training data set using machine learning to generate a machine learning-based detection (MLD) profile, the MLD profile to be used by a data loss prevention (DLP) system to classify new documents as sensitive documents or as non-sensitive documents; and

in response to determining that the training data set does not have satisfactory quality, identifying at least one document from the training data set that caused the quality of the training data set to be reduced.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A computing device receives a training data set that comprises a plurality of sensitive documents and a plurality of non-sensitive documents. The computing device determines a quality of the training data set. The quality may be determined using k-fold cross validation and/or latent semantic indexing. In response to determining that the training data set has a satisfactory quality, the computing device then analyzes the training data set using machine learning to train a machine learning-based detection (MLD) profile, the MLD profile to be used by a data loss prevention (DLP) system to classify new documents as sensitive documents or as non-sensitive documents.

41 Citations

View as Search Results

18 Claims

1. A method, implemented by a computing device, comprising:
- receiving a training data set that comprises a plurality of sensitive documents and a plurality of non-sensitive documents;
  
  determining, by the computing device, a quality of the training data set, wherein determining the quality of the training data set comprises performing at least one of k-fold cross validation or latent semantic indexing using the training data set;
  
  in response to determining that the training data set has a satisfactory quality, analyzing, by the computing device, the training data set using machine learning to generate a machine learning-based detection (MLD) profile, the MLD profile to be used by a data loss prevention (DLP) system to classify new documents as sensitive documents or as non-sensitive documents; and
  
  in response to determining that the training data set does not have satisfactory quality, identifying at least one document from the training data set that caused the quality of the training data set to be reduced.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The method of claim 1, further comprising:
    - for each document in the training data set, determining whether the document is a sensitive document or a non-sensitive document based on performing local weighted latent semantic indexing.
  - 3. The method of claim 1, further comprising:
    - receiving a user selection of a memory allocation via a user interface before analyzing the training data set; and
      
      determining whether a memory utilization for the MLD profile complies with the memory allocation.
  - 4. The method of claim 1, wherein the received training data set is a single data set that does not distinguish between the plurality of sensitive documents or the plurality of non-sensitive documents, the method further comprising:
    - using local weighted latent semantic indexing (LSI) to divide the training data set into a plurality of distinct sets of documents;
      
      identifying a first distinct set of documents as containing the plurality of sensitive documents and a second distinct set of documents as containing the plurality of non-sensitive documents; and
      
      using machine learning with the first distinct set of documents and the second distinct set of documents to generate the machine learning-based detection (MLD) profile.
  - 5. The method of claim 4, wherein the first distinct set of documents is identified as containing the plurality of sensitive documents and the second distinct set of documents is identified as containing the plurality of non-sensitive documents based on user input.
  - 6. The method of claim 1, further comprising:
    - identifying at least one of a document moving through a data loss vector or a request to move the document through the data loss vector; and
      
      determining whether the document is a sensitive document or a non-sensitive document based on application of the MLD profile.

7. A non-transitory computer readable storage medium including instructions that, when executed by a processing device, cause the processing device to perform operations comprising:
- receiving a training data set that comprises a plurality of sensitive documents and a plurality of non-sensitive documents;
  
  determining, by the processing device, a quality of the training data set, wherein determining the quality of the training data set comprises performing at least one of k-fold cross validation or latent semantic indexing using the training data set; and
  
  in response to determining that the training data set has a satisfactory quality, analyzing, by the processing device, the training data set using machine learning to generate a machine learning-based detection (MLD) profile, the MLD profile to be used by a data loss prevention (DLP) system to classify new documents as sensitive documents or as non-sensitive documents.
- View Dependent Claims (8, 9, 10, 11, 12, 13)
- - 8. The non-transitory computer readable storage medium of claim 7, the operations further comprising:
    - in response to determining that the training data set does not have satisfactory quality, identifying at least one document from the training data set that caused the quality of the training data set to be reduced.
  - 9. The non-transitory computer readable storage medium of claim 7, the operations further comprising:
    - for each document in the training data set, determining whether the document is a sensitive document or a non-sensitive document based on performing local weighted latent semantic indexing.
  - 10. The non-transitory computer readable storage medium of claim 7, the operations further comprising:
    - receiving a user selection of a memory allocation via a user interface before analyzing the training data set; and
      
      determining whether a memory utilization for the MLD profile complies with the memory allocation.
  - 11. The non-transitory computer readable storage medium of claim 7, wherein the received training data set is a single data set that does not distinguish between the plurality of sensitive documents or the plurality of non-sensitive documents, the operations further comprising:
    - using local weighted latent semantic indexing (LSI) to divide the training data set into a plurality of distinct sets of documents;
      
      identifying a first distinct set of documents as containing the plurality of sensitive documents and a second distinct set of documents as containing the plurality of non-sensitive documents; and
      
      using machine learning with the first distinct set of documents and the second distinct set of documents to generate the machine learning-based detection (MLD) profile.
  - 12. The non-transitory computer readable storage medium of claim 11, wherein the first distinct set of documents is identified as containing the plurality of sensitive documents and the second distinct set of documents is identified as containing the plurality of non-sensitive documents based on user input.
  - 13. The non-transitory computer readable storage medium of claim 7, the operations further comprising:
    - identifying at least one of a document moving through a data loss vector or a request to move the document through the data loss vector; and
      
      determining whether the document is a sensitive document or a non-sensitive document based on application of the MLD profile.

14. A computing device comprising:
- a memory to store instructions for performing machine learning; and
  
  a processing device, coupled to the memory, to execute the instructions, wherein the processing device is to;
  
  receive a training data set that comprises a plurality of sensitive documents and a plurality of non-sensitive documents;
  
  determine a quality of the training data set, wherein determining the quality of the training data set comprises performing at least one of k-fold cross validation or latent semantic indexing using the training data set;
  
  in response to determining that the training data set has a satisfactory quality, analyze the training data set using machine learning to generate a machine learning based detection (MLD) profile, the MLD profile to be used by a data loss prevention (DLP) system to classify new documents as sensitive documents or as non-sensitive documents; and
  
  in response to determining that the training data set does not have satisfactory quality, identify at least one document from the training data set that caused the quality of the training data set to be reduced.
- View Dependent Claims (15, 16, 17, 18)
- - 15. The computing device of claim 14, wherein the processing device is further configured to:
    - for each document in the training data set, determine whether the document is a sensitive document or a non-sensitive document based on performing local weighted latent semantic indexing.
  - 16. The computing device of claim 14, wherein the received training data set is a single data set that does not distinguish between the plurality of sensitive documents or the plurality of non-sensitive documents, wherein the processing device is further configured to:
    - use local weighted latent semantic indexing (LSI) to divide the training data set into a plurality of distinct sets of documents;
      
      identify a first distinct set of documents as containing the plurality of sensitive documents and a second distinct set of documents as containing the plurality of non-sensitive documents; and
      
      use machine learning with the first distinct set of documents and the second distinct set of documents to generate the machine learning-based detection (MLD) profile.
  - 17. The computing device of claim 16, wherein the first distinct set of documents is identified as containing the plurality of sensitive documents and the second distinct set of documents is identified as containing the plurality of non-sensitive documents based on user input.
  - 18. The computing device of claim 14, wherein the processing device is further to:
    - identify at least one of a document moving through a data loss vector or a request to move the document through the data loss vector; and
      
      determine whether the document is a sensitive document or a non-sensitive document based on application of the MLD profile.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
CA, Inc. (d/b/a CA Technologies) (Broadcom, Inc.)
Original Assignee
Symantec Corporation (NortonLifeLock Inc.)
Inventors
Jaiswal, Sumesh, DiCorpo, Phillip, Sawant, Shitalkumar S., Kauffman, Sally, Galindez, Alan Dale, Aggarwal, Ashish
Primary Examiner(s)
Gaffin, Jeffrey A
Assistant Examiner(s)
OLUDE AFOLABI, OLATOYOSI

Application Number

US13/326,198
Time in Patent Office

1,224 Days
Field of Search

706/12
US Class Current

706/12
CPC Class Codes

G06F 21/6209   to a single file or object,...

G06F 21/6227   where protection concerns t...

G06F 2221/2101   Auditing as a secondary aspect

G06F 2221/2107   File encryption

G06F 2221/2141   Access rights, e.g. capabil...

G06F 2221/2147   Locking files

G06N 20/00   Machine learning

G06Q 10/0631   Resource planning, allocati...

H04L 63/20   for managing network securi...

Data quality assessment for vector machine learning

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

41 Citations

18 Claims

Specification

Use Cases

Quick Links

Others

Data quality assessment for vector machine learning

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

41 Citations

18 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others