×

Filtering training data for machine learning

  • US 7,690,037 B1
  • Filed: 07/13/2005
  • Issued: 03/30/2010
  • Est. Priority Date: 07/13/2005
  • Status: Active Grant
First Claim
Patent Images

1. A method of generating a corpus for training a computerized security system, the security system for monitoring a data center to detect anomalous activity, comprising:

  • collecting by a processor a corpus containing data describing data center activities;

    generating clusters from the corpus, each cluster containing data describing data center activities having like features and containing a number of members corresponding to a number of occurrences of the data center activities having like features in the corpus, wherein the clusters are based on one or more features selected from the set consisting of;

    a source of the data;

    a date or time of the data;

    a structure of the data;

    content of the data; and

    an output produced by the data center responsive to the data;

    identifying clusters possibly representing anomalous activities, wherein identifying the clusters comprises;

    ranking the clusters by number of members in the clusters; and

    applying a threshold to the ranked clusters, the threshold distinguishing between clusters possibly representing anomalous activities and clusters likely to represent legitimate activities;

    removing the data contained in the clusters possibly representing anomalous activities from the corpus; and

    transforming the corpus from which the data contained in the clusters possibly representing anomalous activities were removed into training data for the security system, the training data including a set of query templates to classify incoming queries, wherein the incoming queries are compared with the set of query templates in the security system.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×