Classifying malware by order of network behavior artifacts
First Claim
Patent Images
1. A method of determining whether an executable file is malware by using network behavioral artifacts, the method comprising:
- identifying a training corpus comprising plurality of benign executable files and a plurality of malware executable files;
associating, by an electronic hardware processor, each of a plurality of network behavioral artifacts with a respective character set;
assigning, by an electronic hardware processor, each executable file from the training corpus a respective string of character sets, wherein each string of character sets represents temporally ordered network behavior artifacts of a respective executable file from the training corpus, whereby a plurality of strings of character sets is obtained;
obtaining, by an electronic hardware processor, for each of the plurality of strings of character sets and for a fixed n>
1, a respective set of contiguous substrings of length n;
ordering, by an electronic hardware processor, a union of the respective sets of contiguous substrings of length n, whereby an ordered universe of contiguous substrings of length n is obtained;
forming, for each executable file from the training corpus and by an electronic hardware processor, a respective feature vector, wherein each respective feature vector comprises a tally list comprising counts of contiguous substrings of length n in the respective set of contiguous n-grams for the respective executable file from the training corpus, whereby a plurality of feature vectors is obtained;
classifying, by an electronic hardware processor, each respective feature vector of the plurality of feature vectors as associated with either a benign executable file or a malware executable file from the training corpus, whereby a set of classified feature vectors is obtained;
training a machine learning system with the set of classified feature vectors, wherein the machine learning system comprises an electronic hardware processor;
identifying an unknown executable file;
generating, by an electronic hardware processor, a feature vector for the unknown executable file;
submitting the feature vector for the unknown executable file to the machine learning system;
obtaining, by an electronic hardware processor, a classification of the unknown executable file as one of likely benign and likely malware; and
outputting, by an electronic hardware processor, the classification of the unknown executable file.
1 Assignment
0 Petitions
Accused Products
Abstract
The present invention generally relates to systems and methods for classifying executable files as likely malware or likely benign. The techniques utilize temporally-ordered network behavioral artifacts together with machine learning techniques to perform the classification. Because they rely on network behavioral artifacts, the disclosed techniques may be applied to executable files with obfuscated code.
22 Citations
20 Claims
-
1. A method of determining whether an executable file is malware by using network behavioral artifacts, the method comprising:
-
identifying a training corpus comprising plurality of benign executable files and a plurality of malware executable files; associating, by an electronic hardware processor, each of a plurality of network behavioral artifacts with a respective character set; assigning, by an electronic hardware processor, each executable file from the training corpus a respective string of character sets, wherein each string of character sets represents temporally ordered network behavior artifacts of a respective executable file from the training corpus, whereby a plurality of strings of character sets is obtained; obtaining, by an electronic hardware processor, for each of the plurality of strings of character sets and for a fixed n>
1, a respective set of contiguous substrings of length n;ordering, by an electronic hardware processor, a union of the respective sets of contiguous substrings of length n, whereby an ordered universe of contiguous substrings of length n is obtained; forming, for each executable file from the training corpus and by an electronic hardware processor, a respective feature vector, wherein each respective feature vector comprises a tally list comprising counts of contiguous substrings of length n in the respective set of contiguous n-grams for the respective executable file from the training corpus, whereby a plurality of feature vectors is obtained; classifying, by an electronic hardware processor, each respective feature vector of the plurality of feature vectors as associated with either a benign executable file or a malware executable file from the training corpus, whereby a set of classified feature vectors is obtained; training a machine learning system with the set of classified feature vectors, wherein the machine learning system comprises an electronic hardware processor; identifying an unknown executable file; generating, by an electronic hardware processor, a feature vector for the unknown executable file; submitting the feature vector for the unknown executable file to the machine learning system; obtaining, by an electronic hardware processor, a classification of the unknown executable file as one of likely benign and likely malware; and outputting, by an electronic hardware processor, the classification of the unknown executable file. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A system for determining whether an executable file is malware by using network behavioral artifacts, the system comprising at least one hardware electronic processor configured to:
-
identify a training corpus comprising plurality of benign executable files and a plurality of malware executable files; associate each of a plurality of network behavioral artifacts with a respective character set; assign each executable file from the training corpus a respective string of character sets, wherein each string of character sets represents temporally ordered network behavior artifacts of a respective executable file from the training corpus, whereby a plurality of strings of character sets is obtained; obtain for each of the plurality of strings of character sets and for a fixed n>
1, a respective set of contiguous substrings of length n;order a union of the respective sets of contiguous substrings of length n, whereby an ordered universe of contiguous substrings of length n is obtained; form, for each executable file from the training corpus, a respective feature vector, wherein each respective feature vector comprises a tally list comprising counts of contiguous substrings of length n in the respective set of contiguous n-grams for the respective executable file from the training corpus, whereby a plurality of feature vectors is obtained; classify each respective feature vector of the plurality of feature vectors as associated with either a benign executable file or a malware executable file from the training corpus, whereby a set of classified feature vectors is obtained; train machine learning system with the set of classified feature vectors; identify an unknown executable file; generate a feature vector for the unknown executable file; submit the feature vector for the unknown executable file to the machine learning system; obtain a classification of the unknown executable file as one of likely benign and likely malware; and output the classification of the unknown executable file. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
-
Specification