Classifying malware by order of network behavior artifacts
First Claim
Patent Images
1. A computer-implemented method of determining whether an executable file is malware by using network behavioral artifacts, the method comprising:
- generating network behavioral artifacts for each executable file included in a training corpus comprising one or more executable files classified as benign and one or more executable files classified as malware;
assigning, by an electronic hardware processor, for each executable file included in the training corpus, a respective string of character sets to represent the network behavioral artifacts generated for the executable file;
forming, for each executable file included in the training corpus, a respective feature vector based on the respective string of character sets, wherein the respective feature vector indicates, for each contiguous character substring included in a plurality of contiguous character substrings, how many instances of the contiguous character substring appear in the respective string of character sets;
training a machine learning system based on the respective feature vectors;
generating a feature vector for an unknown executable file;
classifying, by the machine learning system, the unknown executable file as one of likely benign and likely malware based on the feature vector for the unknown executable file; and
outputting the classification of the unknown executable file.
1 Assignment
0 Petitions
Accused Products
Abstract
The present invention generally relates to systems and methods for classifying executable files as likely malware or likely benign. The techniques utilize temporally-ordered network behavioral artifacts together with machine learning techniques to perform the classification. Because they rely on network behavioral artifacts, the disclosed techniques may be applied to executable files with obfuscated code.
20 Citations
20 Claims
-
1. A computer-implemented method of determining whether an executable file is malware by using network behavioral artifacts, the method comprising:
-
generating network behavioral artifacts for each executable file included in a training corpus comprising one or more executable files classified as benign and one or more executable files classified as malware; assigning, by an electronic hardware processor, for each executable file included in the training corpus, a respective string of character sets to represent the network behavioral artifacts generated for the executable file; forming, for each executable file included in the training corpus, a respective feature vector based on the respective string of character sets, wherein the respective feature vector indicates, for each contiguous character substring included in a plurality of contiguous character substrings, how many instances of the contiguous character substring appear in the respective string of character sets; training a machine learning system based on the respective feature vectors; generating a feature vector for an unknown executable file; classifying, by the machine learning system, the unknown executable file as one of likely benign and likely malware based on the feature vector for the unknown executable file; and outputting the classification of the unknown executable file. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A non-transitory computer-readable storage medium storing instructions that, when executed by one or more processors, cause the one or more processors to perform the steps of:
-
generating network behavioral artifacts for each executable file included in a training corpus comprising one or more executable files classified as benign and one or more executable files classified as malware; assigning, for each executable file included in the training corpus, a respective string of character sets to represent the network behavioral artifacts generated for the executable file; forming, for each executable file included in the training corpus, a respective feature vector based on the respective string of character sets, wherein the respective feature vector indicates which contiguous character substrings included in a plurality of contiguous character substrings appear in the respective string of character sets; training a machine learning system based on the respective feature vectors; generating a feature vector for an unknown executable file; classifying, by the machine learning system, the unknown executable file as one of likely benign and likely malware based on the feature vector for the unknown executable file; and outputting the classification of the unknown executable file. - View Dependent Claims (13, 14, 15, 16, 17, 18)
-
-
19. A system for determining whether an executable file is malware by using network behavioral artifacts, the system comprising:
-
one or more memories that include instructions; and one or more processors that are coupled to the one or more memories and, when executing the instructions, are configured to; generate network behavioral artifacts for each executable file included in a training corpus comprising one or more executable files classified as benign and one or more executable files classified as malware; assign for each executable file included in the training corpus, a respective string of character sets to represent the network behavioral artifacts generated for the executable file; for each executable file included in the training corpus, forming a respective feature vector based on the respective string of character sets by ordering a plurality of contiguous substrings appearing in the respective string of character sets based on at least one characteristic of one or more characters included in the contiguous character substrings; train a machine learning system based on the respective feature vectors; generate a feature vector for an unknown executable file; classify, by the machine learning system, the unknown executable file as one of likely benign and likely malware based on the feature vector for the unknown executable file; and output the classification of the unknown executable file. - View Dependent Claims (20)
-
Specification