Filtering training data for machine learning

US 7,690,037 B1
Filed: 07/13/2005
Issued: 03/30/2010
Est. Priority Date: 07/13/2005
Status: Active Grant

First Claim

Patent Images

1. A method of generating a corpus for training a computerized security system, the security system for monitoring a data center to detect anomalous activity, comprising:

collecting by a processor a corpus containing data describing data center activities;

generating clusters from the corpus, each cluster containing data describing data center activities having like features and containing a number of members corresponding to a number of occurrences of the data center activities having like features in the corpus, wherein the clusters are based on one or more features selected from the set consisting of;

a source of the data;

a date or time of the data;

a structure of the data;

content of the data; and

an output produced by the data center responsive to the data;

identifying clusters possibly representing anomalous activities, wherein identifying the clusters comprises;

ranking the clusters by number of members in the clusters; and

applying a threshold to the ranked clusters, the threshold distinguishing between clusters possibly representing anomalous activities and clusters likely to represent legitimate activities;

removing the data contained in the clusters possibly representing anomalous activities from the corpus; and

transforming the corpus from which the data contained in the clusters possibly representing anomalous activities were removed into training data for the security system, the training data including a set of query templates to classify incoming queries, wherein the incoming queries are compared with the set of query templates in the security system.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Data center activity traces form a corpus used for machine learning. The data in the corpus are putatively normal but may be tainted with latent anomalies. There is a statistical likelihood that the corpus represents predominately legitimate activity, and this likelihood is exploited to allow for a targeted examination of only the data representing possible anomalous activity. The corpus is separated into clusters having members with like features. The clusters having the fewest members are identified, as these clusters represent potential anomalous activities. These clusters are evaluated to determine whether they represent actual anomalous activities. The data from the clusters representing actual anomalous activities are excluded from the corpus. As a result, the machine learning is more effective and the trained system provides better performance, since latent anomalies are not mistaken for normal activity.

116 Citations

View as Search Results

10 Claims

1. A method of generating a corpus for training a computerized security system, the security system for monitoring a data center to detect anomalous activity, comprising:
- collecting by a processor a corpus containing data describing data center activities;
  
  generating clusters from the corpus, each cluster containing data describing data center activities having like features and containing a number of members corresponding to a number of occurrences of the data center activities having like features in the corpus, wherein the clusters are based on one or more features selected from the set consisting of;
  
  a source of the data;
  
  a date or time of the data;
  
  a structure of the data;
  
  content of the data; and
  
  an output produced by the data center responsive to the data;
  
  identifying clusters possibly representing anomalous activities, wherein identifying the clusters comprises;
  
  ranking the clusters by number of members in the clusters; and
  
  applying a threshold to the ranked clusters, the threshold distinguishing between clusters possibly representing anomalous activities and clusters likely to represent legitimate activities;
  
  removing the data contained in the clusters possibly representing anomalous activities from the corpus; and
  
  transforming the corpus from which the data contained in the clusters possibly representing anomalous activities were removed into training data for the security system, the training data including a set of query templates to classify incoming queries, wherein the incoming queries are compared with the set of query templates in the security system.
- View Dependent Claims (2, 3)
- - 2. The method of claim 1, wherein identifying the clusters further comprises:
    - examining the clusters possibly representing anomalous activities to determine whether the clusters actually represent anomalous activities;
      
      wherein the data contained in the clusters actually representing anomalous activities are removed from the corpus.
  - 3. The method of claim 1, wherein the data center includes a database, and wherein collecting a corpus comprises:
    - collecting queries sent to the database.

4. A system for generating a corpus for training a security system to detect anomalous activity at a data center, comprising:
- a computer-readable storage medium having executable computer program instructions recorded thereon comprising;
  
  a data collection module adapted to collect a corpus containing data describing data center activities;
  
  a clustering module adapted to generate clusters from the corpus, each cluster containing data describing data center activities having like features and containing a number of members corresponding to a number of occurrences of the data center activities having like features in the corpus, the clustering module adapted to cluster based on one or more features selected from the set consisting of;
  
  a source of the data;
  
  a date or time of the data;
  
  a structure of the data;
  
  content of the data; and
  
  an output produced by the data center responsive to the data;
  
  a filtering module adapted to rank the clusters by number of members in the clusters, apply a threshold to the ranked clusters, the threshold distinguishing between clusters possibly representing anomalous activities and clusters likely to represent legitimate activities, and remove the data contained in the clusters possibly representing anomalous activities from the corpus; and
  
  a transformation module adapted to transform the corpus from which the data contained in the clusters possibly representing anomalous activities were removed into training data for the security system, the training data including a set of query templates to classify incoming queries, wherein the incoming queries are compared with the set of query templates in the security system; and
  
  a computer processor adapted to execute the computer program instructions recorded on the computer-readable storage medium.
- View Dependent Claims (5, 6)
- - 5. The system of claim 4, further comprising:
    - an interface module adapted to provide an interface for examining the clusters possibly representing anomalous activities to determine whether the clusters actually represent anomalous activities;
      
      wherein the filtering module is adapted to remove the data contained in the clusters actually representing anomalous activities from the corpus.
  - 6. The system of claim 4, wherein the data center includes a database, and wherein the data collection module is further adapted to collect queries sent to the database.

7. A computer-readable storage medium having executable computer program instructions recorded thereon for generating a corpus for training a security system to detect anomalous activity at a data center, comprising:
- a data collection module adapted to collect a corpus containing data describing data center activities;
  
  a clustering module adapted to generate clusters from the corpus, each cluster containing data describing data center activities having like features and containing a number of members corresponding to a number of occurrences of the data center activities having like features in the corpus, the clustering module adapted to cluster the data based on one or more features selected from the set consisting of;
  
  a source of the data;
  
  a date or time of the data;
  
  a structure of the data;
  
  content of the data; and
  
  an output produced by the data center responsive to the data;
  
  a filtering module adapted to rank the clusters by number of members in the clusters, apply a threshold to the ranked clusters, the threshold distinguishing between clusters possibly representing anomalous activities and clusters likely to represent legitimate activities, and remove the data contained in the clusters possibly representing anomalous activities from the corpus; and
  
  a transformation module adapted to transform the corpus from which the data contained in the clusters possibly representing anomalous activities were removed into training data for the security system, the training data including a set of query templates to classify incoming queries, wherein the incoming queries are compared with the set of Query templates in the security system.
- View Dependent Claims (8, 9)
- - 8. The computer-readable medium of claim 7, further comprising:
    - an interface module adapted to provide an interface for examining the clusters possibly representing anomalous activities to determine whether the clusters actually represent anomalous activities;
      
      wherein the filtering module is adapted to remove the data contained in the clusters actually representing anomalous activities from the corpus.
  - 9. The computer-readable medium of claim 7, wherein the data center includes a database, and wherein the data collection module is further adapted to collect queries sent to the database.

10. A system for generating a corpus for training a security system to detect anomalous activity at a data center, comprising:
- a computer-readable storage medium having executable computer program instructions recorded thereon comprising;
  
  means for collecting a corpus containing data describing data center activities;
  
  means for generating clusters from the corpus, each cluster containing data describing data center activities having like features and containing a number of members corresponding to a number of occurrences of that the data center activities having like features in the corpus, the means for generating clusters comprising means for clustering the data based on one or more features selected from the set consisting of;
  
  a source of the data;
  
  a date or time of the data;
  
  a structure of the data;
  
  content of the data; and
  
  an output produced by the data center responsive to the data;
  
  means for ranking the clusters by number of members in the clusters, applying a threshold to the ranked clusters, the threshold distinguishing between clusters possibly representing anomalous activities and clusters likely to represent legitimate activities, and removing the data contained in the clusters possibly representing anomalous activities from the corpus; and
  
  means for transforming the corpus from which the data contained in the clusters possibly representing anomalous activities were removed into training data for the security system, the training data including a set of query templates to classify incoming queries, wherein the incoming queries are compared with the set of query templates in the security system.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
CA, Inc. (d/b/a CA Technologies) (Broadcom, Inc.)
Original Assignee
Symantec Corporation (NortonLifeLock Inc.)
Inventors
Hartmann, Alfred C.
Primary Examiner(s)
Moazzami; Nasser
Assistant Examiner(s)
Reza; Mohammad W

Application Number

US11/181,221
Time in Patent Office

1,721 Days
Field of Search

707/6, 726 22- 25
US Class Current

726/23
CPC Class Codes

G06F 21/552   involving long-term monitor...

H04L 63/1416   Event detection, e.g. attac...

H04L 63/1433   Vulnerability analysis

Filtering training data for machine learning

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

116 Citations

10 Claims

Specification

Solutions

Use Cases

Quick Links

Filtering training data for machine learning

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

116 Citations

10 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links