Adaptive statistical regression and classification of data strings, with application to the generic detection of computer viruses
First Claim
1. A method of constructing a classifier of data strings, comprising:
- defining a set of classes, wherein one of the classes is a default class;
providing a labelled set of exemplars from a plurality of the classes;
defining a set of features, based on the examplars, that are statistically likely to be relevant to the classification;
developing a classifier that uses the occurrence frequency of the features in an input string to classify that string, augmenting the number of exemplars in the default class with additional exemplars chosen from outside the plurality of classes.
2 Assignments
0 Petitions
Accused Products
Abstract
A data string is a sequence of atomic units of data that represent information. In the context of computer data, examples of data strings include executable programs, data files, and boot records consisting of sequences of bytes, or text files consisting of sequences of bytes or characters. The invention solves the problem of automatically constructing a classifier of data strings, i.e., constructing a classifier which, given a string, determines which of two or more class labels should be assigned to it. From a set of (string, class-label) pairs, this invention provides an automated technique for extracting features of data strings that are relevant to the classification decision, and an automated technique for developing a classifier which uses those features to classify correctly the data strings in the original examples and, with high accuracy, classify correctly novel data strings not contained in the example set. The classifier is developed using "adaptive" or "learning" techniques from the domain of statistical regression and classification, such as, e.g., multi-layer neural networks. As an example, the technique can be applied to the task of distinguishing files or boot records that are infected by computer viruses from files or boot records that are not infected.
-
Citations
15 Claims
-
1. A method of constructing a classifier of data strings, comprising:
-
defining a set of classes, wherein one of the classes is a default class; providing a labelled set of exemplars from a plurality of the classes; defining a set of features, based on the examplars, that are statistically likely to be relevant to the classification; developing a classifier that uses the occurrence frequency of the features in an input string to classify that string, augmenting the number of exemplars in the default class with additional exemplars chosen from outside the plurality of classes. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A method of constructing a classifier of data strings, comprising:
-
providing a labelled set of exemplars from a plurality of classes; augmenting the number of exemplars in the default class with additional exemplars chosen from outside the plurality of classes, wherein the additional exemplars resemble the default exemplars; defining a set of features, based on the exemplars, that are statistically likely to be relevant to the classification, the features being contiguous data sequences possessing lengths within a specified range, and which are common in one class and uncommon in other classes, wherein defining a set of features comprises; for each class; forming a minimal list of all features which appear among the exemplars of that class with a relative frequency greater than a specified threshold; and
eliminating from the list every feature for which the value of a specified monotonically non-decreasing multi-variate function of the relative frequencies of that feature within each of the other classes is greater than a specified threshold;selecting from the remaining features in the list for each class a set of one or more features such that each exemplar in the class contains at least a number ncover of the features in the set; selecting from the remaining features in the list for each class a set of one or more features having orthogonal patterns of occurrence among the exemplars of that class, such that each exemplar in that class contains at least ncover of the features in the set; developing a classifier that uses the occurrence frequency of the features in an input string to classify that string, wherein developing the classifier comprises training the classifier on the exemplars to take as input a function of the occurrence frequency of each of the features, and to produce as output a class label.
-
-
8. A method of constructing a classifier of data strings, comprising:
-
providing a labelled set of exemplars from a plurality of classes; defining a set of features, based on the exemplars, that are statistically likely to be relevant to the classification, comprising; for each class; forming a list of each feature for which the average of a transformed frequency, defined as a specified transforming function of the occurrence frequency of that feature, among the exemplars of that class exceeds a predetermined threshold; eliminating from the list every feature for which the value of a specified monotonically non-decreasing multi-variate function of the average of the transformed frequency of that feature in each of the other classes exceeds a predetermined threshold; developing a classifier that uses the occurrence frequency of the features in an input string to classify that string, wherein developing the classifier comprises training the classifier on the exemplars to take as input a function of the occurrence frequency of each of the features, and to produce as output a class label. - View Dependent Claims (9, 10, 11, 12, 13, 14, 15)
-
Specification