Filtering training data for machine learning
First Claim
1. A method of generating a corpus for training a computerized security system, the security system for monitoring a data center to detect anomalous activity, comprising:
- collecting by a processor a corpus containing data describing data center activities;
generating clusters from the corpus, each cluster containing data describing data center activities having like features and containing a number of members corresponding to a number of occurrences of the data center activities having like features in the corpus, wherein the clusters are based on one or more features selected from the set consisting of;
a source of the data;
a date or time of the data;
a structure of the data;
content of the data; and
an output produced by the data center responsive to the data;
identifying clusters possibly representing anomalous activities, wherein identifying the clusters comprises;
ranking the clusters by number of members in the clusters; and
applying a threshold to the ranked clusters, the threshold distinguishing between clusters possibly representing anomalous activities and clusters likely to represent legitimate activities;
removing the data contained in the clusters possibly representing anomalous activities from the corpus; and
transforming the corpus from which the data contained in the clusters possibly representing anomalous activities were removed into training data for the security system, the training data including a set of query templates to classify incoming queries, wherein the incoming queries are compared with the set of query templates in the security system.
2 Assignments
0 Petitions
Accused Products
Abstract
Data center activity traces form a corpus used for machine learning. The data in the corpus are putatively normal but may be tainted with latent anomalies. There is a statistical likelihood that the corpus represents predominately legitimate activity, and this likelihood is exploited to allow for a targeted examination of only the data representing possible anomalous activity. The corpus is separated into clusters having members with like features. The clusters having the fewest members are identified, as these clusters represent potential anomalous activities. These clusters are evaluated to determine whether they represent actual anomalous activities. The data from the clusters representing actual anomalous activities are excluded from the corpus. As a result, the machine learning is more effective and the trained system provides better performance, since latent anomalies are not mistaken for normal activity.
116 Citations
10 Claims
-
1. A method of generating a corpus for training a computerized security system, the security system for monitoring a data center to detect anomalous activity, comprising:
-
collecting by a processor a corpus containing data describing data center activities; generating clusters from the corpus, each cluster containing data describing data center activities having like features and containing a number of members corresponding to a number of occurrences of the data center activities having like features in the corpus, wherein the clusters are based on one or more features selected from the set consisting of;
a source of the data;
a date or time of the data;
a structure of the data;
content of the data; and
an output produced by the data center responsive to the data;identifying clusters possibly representing anomalous activities, wherein identifying the clusters comprises; ranking the clusters by number of members in the clusters; and applying a threshold to the ranked clusters, the threshold distinguishing between clusters possibly representing anomalous activities and clusters likely to represent legitimate activities; removing the data contained in the clusters possibly representing anomalous activities from the corpus; and transforming the corpus from which the data contained in the clusters possibly representing anomalous activities were removed into training data for the security system, the training data including a set of query templates to classify incoming queries, wherein the incoming queries are compared with the set of query templates in the security system. - View Dependent Claims (2, 3)
-
-
4. A system for generating a corpus for training a security system to detect anomalous activity at a data center, comprising:
-
a computer-readable storage medium having executable computer program instructions recorded thereon comprising; a data collection module adapted to collect a corpus containing data describing data center activities; a clustering module adapted to generate clusters from the corpus, each cluster containing data describing data center activities having like features and containing a number of members corresponding to a number of occurrences of the data center activities having like features in the corpus, the clustering module adapted to cluster based on one or more features selected from the set consisting of;
a source of the data;
a date or time of the data;
a structure of the data;
content of the data; and
an output produced by the data center responsive to the data;a filtering module adapted to rank the clusters by number of members in the clusters, apply a threshold to the ranked clusters, the threshold distinguishing between clusters possibly representing anomalous activities and clusters likely to represent legitimate activities, and remove the data contained in the clusters possibly representing anomalous activities from the corpus; and a transformation module adapted to transform the corpus from which the data contained in the clusters possibly representing anomalous activities were removed into training data for the security system, the training data including a set of query templates to classify incoming queries, wherein the incoming queries are compared with the set of query templates in the security system; and a computer processor adapted to execute the computer program instructions recorded on the computer-readable storage medium. - View Dependent Claims (5, 6)
-
-
7. A computer-readable storage medium having executable computer program instructions recorded thereon for generating a corpus for training a security system to detect anomalous activity at a data center, comprising:
-
a data collection module adapted to collect a corpus containing data describing data center activities; a clustering module adapted to generate clusters from the corpus, each cluster containing data describing data center activities having like features and containing a number of members corresponding to a number of occurrences of the data center activities having like features in the corpus, the clustering module adapted to cluster the data based on one or more features selected from the set consisting of;
a source of the data;
a date or time of the data;
a structure of the data;
content of the data; and
an output produced by the data center responsive to the data;a filtering module adapted to rank the clusters by number of members in the clusters, apply a threshold to the ranked clusters, the threshold distinguishing between clusters possibly representing anomalous activities and clusters likely to represent legitimate activities, and remove the data contained in the clusters possibly representing anomalous activities from the corpus; and a transformation module adapted to transform the corpus from which the data contained in the clusters possibly representing anomalous activities were removed into training data for the security system, the training data including a set of query templates to classify incoming queries, wherein the incoming queries are compared with the set of Query templates in the security system. - View Dependent Claims (8, 9)
-
-
10. A system for generating a corpus for training a security system to detect anomalous activity at a data center, comprising:
a computer-readable storage medium having executable computer program instructions recorded thereon comprising; means for collecting a corpus containing data describing data center activities; means for generating clusters from the corpus, each cluster containing data describing data center activities having like features and containing a number of members corresponding to a number of occurrences of that the data center activities having like features in the corpus, the means for generating clusters comprising means for clustering the data based on one or more features selected from the set consisting of;
a source of the data;
a date or time of the data;
a structure of the data;
content of the data; and
an output produced by the data center responsive to the data;means for ranking the clusters by number of members in the clusters, applying a threshold to the ranked clusters, the threshold distinguishing between clusters possibly representing anomalous activities and clusters likely to represent legitimate activities, and removing the data contained in the clusters possibly representing anomalous activities from the corpus; and means for transforming the corpus from which the data contained in the clusters possibly representing anomalous activities were removed into training data for the security system, the training data including a set of query templates to classify incoming queries, wherein the incoming queries are compared with the set of query templates in the security system.
Specification