Systems and methods for classifying documents for data loss prevention

US 9,043,247 B1
Filed: 02/25/2012
Issued: 05/26/2015
Est. Priority Date: 02/25/2012
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method for classifying documents for data loss prevention, at least a portion of the method being performed by a computing device comprising at least one processor, the method comprising:

identifying a set of prospective training documents for a machine learning classifier that is configured to provide input for data loss prevention determinations, the set of prospective training documents comprising documents regarded as sensitive;

performing a semantic analysis on the set of prospective training documents to identify a plurality of topics within the set of prospective training documents;

applying a similarity metric to the plurality of topics to identify at least one unrelated topic within the plurality of topics with a similarity to the other topics within the plurality of topics, as determined by the similarity metric, that falls below a predetermined similarity threshold, and to thereby determine that the unrelated topic is unrelated to the other topics within the plurality of topics;

identifying, based at least in part on the semantic analysis, at least one irrelevant prospective training document within the set of prospective training documents in which a predominance of the unrelated topic is above a predetermined predominance threshold by determining the predominance of the unrelated topic by the presence of the unrelated topic within the irrelevant prospective training document as identified in the semantic analysis and not by the presence within the irrelevant prospective training document of any related topic within the plurality of topics;

excluding the irrelevant prospective training document from the set of prospective training documents based at least in part on the predominance of the unrelated topic within the irrelevant prospective training document.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A computer-implemented method for classifying documents for data loss prevention may include 1) identifying training documents for a machine learning classifier configured for data loss prevention, 2) performing a semantic analysis on training documents to identify topics within the set training documents, 3) applying a similarity metric to the topics to identify at least one unrelated topic with a similarity to the other topics within the plurality of topics, as determined by the similarity metric, that falls below a similarity threshold, 4) identifying, based on the semantic analysis, at least one irrelevant training document within the set of training documents in which a predominance of the unrelated topic is above a predominance threshold, and 5) excluding the irrelevant training document from the set of training documents based on the predominance of the unrelated topic within the irrelevant training document. Various other methods, systems, and computer-readable media are also disclosed.

20 Citations

View as Search Results

20 Claims

1. A computer-implemented method for classifying documents for data loss prevention, at least a portion of the method being performed by a computing device comprising at least one processor, the method comprising:
- identifying a set of prospective training documents for a machine learning classifier that is configured to provide input for data loss prevention determinations, the set of prospective training documents comprising documents regarded as sensitive;
  
  performing a semantic analysis on the set of prospective training documents to identify a plurality of topics within the set of prospective training documents;
  
  applying a similarity metric to the plurality of topics to identify at least one unrelated topic within the plurality of topics with a similarity to the other topics within the plurality of topics, as determined by the similarity metric, that falls below a predetermined similarity threshold, and to thereby determine that the unrelated topic is unrelated to the other topics within the plurality of topics;
  
  identifying, based at least in part on the semantic analysis, at least one irrelevant prospective training document within the set of prospective training documents in which a predominance of the unrelated topic is above a predetermined predominance threshold by determining the predominance of the unrelated topic by the presence of the unrelated topic within the irrelevant prospective training document as identified in the semantic analysis and not by the presence within the irrelevant prospective training document of any related topic within the plurality of topics;
  
  excluding the irrelevant prospective training document from the set of prospective training documents based at least in part on the predominance of the unrelated topic within the irrelevant prospective training document.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The computer-implemented method of claim 1, wherein the semantic analysis comprises a latent Dirichlet allocation applied to the set of prospective training documents.
  - 3. The computer-implemented method of claim 1, wherein excluding the irrelevant prospective training document further comprises:
    - presenting the irrelevant prospective training document to a user;
      
      receiving input from the user to exclude the irrelevant prospective training document from the set of prospective training documents.
  - 4. The computer-implemented method of claim 3, wherein:
    - identifying the irrelevant prospective training document comprises identifying a subset of prospective training documents within the set of prospective training documents in each of which a corresponding predominance of the unrelated topic is above the predetermined predominance threshold;
      
      presenting the irrelevant prospective training document to the user comprises selecting the irrelevant prospective training document as representative of the subset of prospective training documents based at least in part on a representativeness metric applied to the irrelevant prospective training document.
  - 5. The computer-implemented method of claim 1, wherein:
    - identifying the unrelated topic comprises identifying a list of most common terms for each topic within the set of topics;
      
      the similarity metric comprises a number of overlapping terms between lists of most common terms for respective topics within the set of topics.
  - 6. The computer-implemented method of claim 1, further comprising adding a fingerprint of the irrelevant prospective training document to a list of potentially sensitive documents for data loss prevention determinations.
  - 7. The computer-implemented method of claim 1, further comprising performing a keyword extraction on the irrelevant prospective training document for use in data loss prevention determinations.
  - 8. The computer-implemented method of claim 1, further comprising:
    - applying the similarity metric to identify an additional unrelated topic within the plurality of topics;
      
      identifying an additional irrelevant prospective training document based at least in part on the additional unrelated topic;
      
      excluding the additional irrelevant prospective training document from the set of prospective training documents.
  - 9. The computer-implemented method of claim 1, further comprising, after excluding the irrelevant prospective training document from the set of prospective training documents, training the machine learning classifier with the set of prospective training documents.
  - 10. The computer-implemented method of claim 9, further comprising:
    - performing a data loss prevention analysis on a sensitive document based at least in part on the machine learning classifier;
      
      performing a data loss prevention action on the sensitive document based on the data loss prevention analysis.

11. A system for classifying documents for data loss prevention, the system comprising:
- an identification module programmed to identify a set of prospective training documents for a machine learning classifier that is configured to provide input for data loss prevention determinations, the set of prospective training documents comprising documents regarded as sensitive;
  
  an analysis module programmed to perform a semantic analysis on the set of prospective training documents to identify a plurality of topics within the set of prospective training documents;
  
  a similarity module programmed to apply a similarity metric to the plurality of topics to identify at least one unrelated topic within the plurality of topics with a similarity to the other topics within the plurality of topics, as determined by the similarity metric, that falls below a predetermined similarity threshold, and to thereby determine that the unrelated topic is unrelated to the other topics within the plurality of topics;
  
  a predominance module programmed to identify, based at least in part on the semantic analysis, at least one irrelevant prospective training document within the set of prospective training documents in which a predominance of the unrelated topic is above a predetermined predominance threshold by determining the predominance of the unrelated topic by the presence of the unrelated topic within the irrelevant prospective training document as identified in the semantic analysis and not by the presence within the irrelevant prospective training document of any related topic within the plurality of topics;
  
  an exclusion module programmed to exclude the irrelevant prospective training document from the set of prospective training documents based at least in part on the predominance of the unrelated topic within the irrelevant prospective training document;
  
  at least one processor configured to execute the identification module, the analysis module, the similarity module, the predominance module, and the exclusion module.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19)
- - 12. The system of claim 11, wherein the semantic analysis comprises a latent Dirichlet allocation applied to the set of prospective training documents.
  - 13. The system of claim 11, wherein the exclusion module is further programmed to exclude the irrelevant prospective training document by:
    - presenting the irrelevant prospective training document to a user;
      
      receiving input from the user to exclude the irrelevant prospective training document from the set of prospective training documents.
  - 14. The system of claim 13, wherein:
    - the predominance module is programmed to identify the irrelevant prospective training document by identifying a subset of prospective training documents within the set of prospective training documents in each of which a corresponding predominance of the unrelated topic is above the predetermined predominance threshold;
      
      the exclusion module is programmed to present the irrelevant prospective training document to the user by selecting the irrelevant prospective training document as representative of the subset of prospective training documents based at least in part on a representativeness metric applied to the irrelevant prospective training document.
  - 15. The system of claim 11, wherein:
    - the similarity module is programmed to identify the unrelated topic by identifying a list of most common terms for each topic within the set of topics;
      
      the similarity metric comprises a number of overlapping terms between lists of most common terms for respective topics within the set of topics.
  - 16. The system of claim 11, wherein the exclusion module is further programmed to add a fingerprint of the irrelevant prospective training document to a list of potentially sensitive documents for data loss prevention determinations.
  - 17. The system of claim 11, wherein the exclusion module is further programmed to perform a keyword extraction on the irrelevant prospective training document for use in data loss prevention determinations.
  - 18. The system of claim 11, wherein:
    - the similarity module is further programmed to apply the similarity metric to identify an additional unrelated topic within the plurality of topics;
      
      the predominance module is further programmed to identify an additional irrelevant prospective training document based at least in part on the additional unrelated topic;
      
      the exclusion module is further programmed to exclude the additional irrelevant prospective training document from the set of prospective training documents.
  - 19. The system of claim 11, wherein the exclusion module is further programmed to, after excluding the irrelevant prospective training document from the set of prospective training documents, train the machine learning classifier with the set of prospective training documents.

20. A non-transitory computer-readable-storage medium comprising one or more computer-executable instructions that, when executed by at least one processor of a computing device, cause the computing device to:
- identify a set of prospective training documents for a machine learning classifier that is configured to provide input for data loss prevention determinations, the set of prospective training documents comprising documents regarded as sensitive;
  
  perform a semantic analysis on the set of prospective training documents to identify a plurality of topics within the set of prospective training documents;
  
  apply a similarity metric to the plurality of topics to identify at least one unrelated topic within the plurality of topics with a similarity to the other topics within the plurality of topics, as determined by the similarity metric, that falls below a predetermined similarity threshold, and to thereby determine that the unrelated topic is unrelated to the other topics within the plurality of topics;
  
  identify, based at least in part on the semantic analysis, at least one irrelevant prospective training document within the set of prospective training documents in which a predominance of the unrelated topic is above a predetermined predominance threshold by determining the predominance of the unrelated topic by the presence of the unrelated topic within the irrelevant prospective training document as identified in the semantic analysis and not by the presence within the irrelevant prospective training document of any related topic within the plurality of topics;
  
  exclude the irrelevant prospective training document from the set of prospective training documents based at least in part on the predominance of the unrelated topic within the irrelevant prospective training document.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
CA, Inc. (d/b/a CA Technologies) (Broadcom, Inc.)
Original Assignee
Symantec Corporation (NortonLifeLock Inc.)
Inventors
Hart, Michael, DiCorpo, Phillip, Tayal, Kushal
Primary Examiner(s)
Chaki, Kakali
Assistant Examiner(s)
Nilsson, Eric

Application Number

US13/405,293
Time in Patent Office

1,186 Days
Field of Search

None
US Class Current

706/12
CPC Class Codes

G06F 16/00   Information retrieval; Data...

G06F 16/35   Clustering; Classification

G06N 20/00   Machine learning

G06N 5/02   Knowledge representation; S...

G06N 7/01   Probabilistic graphical mod...

G06V 30/40   Document-oriented image-bas...

G06V 30/418   Document matching, e.g. of ...

Systems and methods for classifying documents for data loss prevention

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

20 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Systems and methods for classifying documents for data loss prevention

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

20 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links