Systems and methods for generating machine learning-based classifiers for detecting specific categories of sensitive information

US 8,688,601 B2
Filed: 07/26/2011
Issued: 04/01/2014
Est. Priority Date: 05/23/2011
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method for generating machine learning-based classifiers for detecting specific categories of sensitive information, at least a portion of the method being performed by a computing device comprising at least one processor, the method comprising:

identifying a plurality of specific categories of sensitive information to be protected by a data loss prevention (DLP) system;

obtaining a training data set customized for each specific category of sensitive information that comprises a plurality of positive examples of data that fall within the specific category of sensitive information and a plurality of negative examples of data that do not fall within the specific category of sensitive information;

using machine learning to train, based on an analysis of the training data sets, at least one machine learning-based classifier to detect items of data that contain one or more of the plurality of specific categories of sensitive information;

deploying the machine learning-based classifier within the DLP system to enable the DLP system to detect and protect, using the machine learning-based classifier, items of data that contain one or more of the plurality of specific categories of sensitive information by performing at least one DLP action specified by at least one DLP policy of the DLP system, wherein the DLP action is selected based at least in part on whether the item of data comprises a percentage of one or more of the plurality of specific categories of sensitive information that exceeds a predetermined percentage threshold.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A computer-implemented method may include (1) identifying a plurality of specific categories of sensitive information to be protected by a DLP system, (2) obtaining a training data set for each specific category of sensitive information that includes a plurality of positive and a plurality of negative examples of the specific category of sensitive information, (3) using machine learning to train, based on an analysis of the training data sets, at least one machine learning-based classifier that is capable of detecting items of data that contain one or more of the plurality of specific categories of sensitive information, and then (4) deploying the machine learning-based classifier within the DLP system to enable the DLP system to detect and protect items of data that contain one or more of the plurality of specific categories of sensitive information in accordance with at least one DLP policy of the DLP system.

Citations

20 Claims

1. A computer-implemented method for generating machine learning-based classifiers for detecting specific categories of sensitive information, at least a portion of the method being performed by a computing device comprising at least one processor, the method comprising:
- identifying a plurality of specific categories of sensitive information to be protected by a data loss prevention (DLP) system;
  
  obtaining a training data set customized for each specific category of sensitive information that comprises a plurality of positive examples of data that fall within the specific category of sensitive information and a plurality of negative examples of data that do not fall within the specific category of sensitive information;
  
  using machine learning to train, based on an analysis of the training data sets, at least one machine learning-based classifier to detect items of data that contain one or more of the plurality of specific categories of sensitive information;
  
  deploying the machine learning-based classifier within the DLP system to enable the DLP system to detect and protect, using the machine learning-based classifier, items of data that contain one or more of the plurality of specific categories of sensitive information by performing at least one DLP action specified by at least one DLP policy of the DLP system, wherein the DLP action is selected based at least in part on whether the item of data comprises a percentage of one or more of the plurality of specific categories of sensitive information that exceeds a predetermined percentage threshold.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The computer-implemented method of claim 1, wherein, for each training data set, the negative examples within the training data set comprise the positive examples from all other training data sets.
  - 3. The computer-implemented method of claim 1, wherein using machine learning to train the machine learning-based classifier comprises, for each training data set:
    - extracting a feature set from the training data set that comprises statistically significant features of the positive examples within the training data set and statistically significant features of the negative examples within the training data set;
      
      building a machine learning-based classification model from the feature set that is capable of indicating whether or not items of data contain the specific category of sensitive information associated with the training data set.
  - 4. The computer-implemented method of claim 1, wherein the machine learning-based classifier detects items of data that contain more than one of the plurality of specific categories of sensitive information.
  - 5. The computer-implemented method of claim 4, wherein, for each item of data that contains more than one of the plurality of specific categories of sensitive information, the machine learning-based classifier is configured to identify at least one of:
    - the specific categories of sensitive information that the item of data contains;
      
      for each specific category of sensitive information that the item of data contains, the percentage of the item of data that comprises that specific category of sensitive information;
      
      for each specific category of sensitive information that the item of data contains, the specific portion of the item of data that contains that specific category of sensitive information.
  - 6. The computer-implemented method of claim 1, wherein the DLP action comprises, for each specific category of sensitive information contained within the item of data, at least one of:
    - restricting access to the item of data to entities that are authorized to access the specific category of sensitive information;
      
      restricting access to the portion of the item of data that contains the specific category of sensitive information to entities that are authorized to access the specific category of sensitive information;
      
      automatically appending a custom disclaimer to the item of data that applies to the category of sensitive information.
  - 7. The computer-implemented method of claim 6, wherein restricting access to the portion of the item of data that contains the specific category of sensitive information to entities that are authorized to access the specific category of sensitive information comprises, prior to allowing an entity to access the item of data, redacting portions from the item of data that contain specific categories of sensitive information that the entity is not authorized to access.
  - 8. The computer-implemented method of claim 7, further comprising replacing the redacted information with a notification that indicates that an entity does not have access rights to view the redacted information.
  - 9. The system of claim 8, wherein the training module is configured to perform feature extraction multiple times, each time using a different feature-extraction algorithm.
  - 10. The computer-implemented method of claim 1, wherein deploying the machine learning-based classifier within the DLP system comprises providing the machine learning-based classifier as part of the DLP policy to at least one of:
    - a DLP agent installed on at least one client device;
      
      a DLP engine installed on at least one server configured to monitor a plurality of client devices.
  - 11. The computer-implemented method of claim 1, further comprising, upon deploying the machine learning-based classifier within the DLP system:
    - identifying an attempt to access at least one item of data via a data-loss vector;
      
      determining, using the machine learning-based classifier, that the item of data comprises a percentage of one or more of the plurality of specific categories of sensitive information that exceeds the predetermined percentage threshold;
      
      protecting at least a portion of the item of data by performing the DLP action specified by the DLP policy of the DLP system.

12. A system for generating machine learning-based classifiers for use in detecting specific categories of sensitive information, the system comprising:
- an identification module programmed to identify a plurality of specific categories of sensitive information to be protected by a data loss prevention (DLP) system;
  
  a training module programmed to;
  
  obtain a training data set customized for each specific category of sensitive information that comprises a plurality of positive examples of data that fall within the specific category of sensitive information and a plurality of negative examples of data that do not fall within the specific category of sensitive information;
  
  use machine learning to train, based on an analysis of the training data sets, at least one machine learning-based classifier to detect items of data that contain one or more of the plurality of specific categories of sensitive information;
  
  a deployment module programmed to deploy the machine learning-based classifier within the DLP system to enable the DLP system to detect and protect, using the machine learning-based classifier, items of data that contain one or more of the plurality of specific categories of sensitive information by performing at least one DLP action specified by at least one DLP policy of the DLP system, wherein the DLP action is selected based at least in part on whether the item of data comprises a percentage of one or more of the plurality of specific categories of sensitive information that exceeds a predetermined threshold;
  
  at least one hardware processor configured to execute at least one of the identification module, the training module, and the deployment module.
- View Dependent Claims (13, 14, 15, 16, 17, 18, 19)
- - 13. The system of claim 12, wherein, for each training data set, the negative examples within the training data set comprise the positive examples from all other training data sets.
  - 14. The system of claim 12, wherein the machine learning-based classifier is configured to detect items of data that contain more than one of the plurality of specific categories of sensitive information and, for each item of data that contains more than one of the plurality of specific categories of sensitive information, the machine learning-based classifier is configured to identify at least one of:
    - the specific categories of sensitive information that the item of data contains;
      
      for each specific category of sensitive information that the item of data contains, the percentage of the item of data that comprises that specific category of sensitive information;
      
      for each specific category of sensitive information that the item of data contains, the specific portion of the item of data that contains that specific category of sensitive information.
  - 15. The system of claim 12, wherein the DLP action comprises, for each specific category of sensitive information contained within the item of data, at least one of:
    - restricting access to the item of data to entities that are authorized to access the specific category of sensitive information;
      
      restricting access to the portion of the item of data that contains the specific category of sensitive information to entities that are authorized to access the specific category of sensitive information;
      
      automatically appending a custom disclaimer to the item of data that applies to the category of sensitive information.
  - 16. The system of claim 12, wherein the deployment module deploys the machine learning-based classifier within the DLP system by providing the machine learning-based classifier as part of the DLP policy to at least one of:
    - a DLP agent installed on at least one client device;
      
      a DLP engine installed on at least one server configured to monitor a plurality of client devices.
  - 17. The system of claim 12, further comprising a DLP module programmed to:
    - identify an attempt to access at least one item of data via a data-loss vector;
      
      determine, using the machine learning-based classifier, that the item of data contains one or more of the plurality of specific categories of sensitive information;
      
      protect at least a portion of the item of data in accordance with the DLP policy of the DLP system.
  - 18. The system of claim 12, wherein the training module is configured to use machine learning to train the machine learning-based classifier by, for each training data set:
    - extracting a feature set from the training data set that comprises statistically significant features of the positive examples within the training data set and statistically significant features of the negative examples within the training data set;
      
      building a machine learning-based classification model from the feature set that is capable of indicating whether or not items of data contain the specific category of sensitive information associated with the training data set.
  - 19. The system of claim 18, further comprising at least one of:
    - selecting the features within each feature set using a feature-extraction algorithm;
      
      weighting the features within each feature set using a feature-weighting algorithm.

20. A non-transitory computer-readable-storage medium comprising one or more computer-executable instructions that, when executed by at least one processor of a computing device, cause the computing device to:
- identify a plurality of specific categories of sensitive information to be protected by a data loss prevention (DLP) system;
  
  obtain a training data set customized for each specific category of sensitive information that comprises a plurality of positive examples of data that fall within the specific category of sensitive information and a plurality of negative examples of data that do not fall within the specific category of sensitive information;
  
  use machine learning to train, based on an analysis of the training data sets, at least one machine learning-based classifier to detect items of data that contain one or more of the plurality of specific categories of sensitive information;
  
  deploy the machine learning-based classifier within the DLP system to enable the DLP system to detect and protect, using the machine learning-based classifier, items of data that contain one or more of the plurality of specific categories of sensitive information by performing at least one DLP action specified by at least one policy of the DLP system, wherein the DLP action is selected based at least in part on whether the item of data comprises a specific percentage of one or more of the plurality of specific categories of sensitive information that exceeds a predetermined percentage threshold.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
CA, Inc. (d/b/a CA Technologies) (Broadcom, Inc.)
Original Assignee
Symantec Corporation (NortonLifeLock Inc.)
Inventors
Jaiswal, Sumesh
Primary Examiner(s)
CHANG, LI WU

Application Number

US13/191,018
Publication Number

US 20120303558A1
Time in Patent Office

980 Days
Field of Search

None
US Class Current

706/12
CPC Class Codes

G06N 20/00 Machine learning

Systems and methods for generating machine learning-based classifiers for detecting specific categories of sensitive information

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Systems and methods for generating machine learning-based classifiers for detecting specific categories of sensitive information

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links