Classifying malware by order of network behavior artifacts

US 9,779,238 B2
Filed: 11/08/2016
Issued: 10/03/2017
Est. Priority Date: 10/11/2013
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method of determining whether an executable file is malware by using network behavioral artifacts, the method comprising:

generating network behavioral artifacts for each executable file included in a training corpus comprising one or more executable files classified as benign and one or more executable files classified as malware;

assigning, by an electronic hardware processor, for each executable file included in the training corpus, a respective string of character sets to represent the network behavioral artifacts generated for the executable file;

forming, for each executable file included in the training corpus, a respective feature vector based on the respective string of character sets, wherein the respective feature vector indicates, for each contiguous character substring included in a plurality of contiguous character substrings, how many instances of the contiguous character substring appear in the respective string of character sets;

training a machine learning system based on the respective feature vectors;

generating a feature vector for an unknown executable file;

classifying, by the machine learning system, the unknown executable file as one of likely benign and likely malware based on the feature vector for the unknown executable file; and

outputting the classification of the unknown executable file.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The present invention generally relates to systems and methods for classifying executable files as likely malware or likely benign. The techniques utilize temporally-ordered network behavioral artifacts together with machine learning techniques to perform the classification. Because they rely on network behavioral artifacts, the disclosed techniques may be applied to executable files with obfuscated code.

20 Citations

20 Claims

1. A computer-implemented method of determining whether an executable file is malware by using network behavioral artifacts, the method comprising:
- generating network behavioral artifacts for each executable file included in a training corpus comprising one or more executable files classified as benign and one or more executable files classified as malware;
  
  assigning, by an electronic hardware processor, for each executable file included in the training corpus, a respective string of character sets to represent the network behavioral artifacts generated for the executable file;
  
  forming, for each executable file included in the training corpus, a respective feature vector based on the respective string of character sets, wherein the respective feature vector indicates, for each contiguous character substring included in a plurality of contiguous character substrings, how many instances of the contiguous character substring appear in the respective string of character sets;
  
  training a machine learning system based on the respective feature vectors;
  
  generating a feature vector for an unknown executable file;
  
  classifying, by the machine learning system, the unknown executable file as one of likely benign and likely malware based on the feature vector for the unknown executable file; and
  
  outputting the classification of the unknown executable file.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The computer-implemented method of claim 1, further comprising:
    - obtaining for each of the plurality of strings of character sets and for a fixed n>
      
      1, a respective set of contiguous substrings of length n; and
      
      ordering a union of the respective sets of contiguous substrings of length n, whereby an ordered universe of contiguous substrings of length n is obtained,wherein the respective feature vector indicates how many instances of the contiguous substrings of length n appear in the respective string of character sets.
  - 3. The computer-implemented method of claim 1, wherein the network behavioral artifacts comprise artifacts of at least one of the following types:
    - traffic direction type, protocol type, port number type, size type, domain name system (DNS) record type, and hypertext transfer protocol (HTTP) method type.
  - 4. The computer-implemented method of claim 2, wherein the traffic direction type comprises an inbound traffic artifact and an outbound traffic artifact;
    - the protocol type comprises a user datagram protocol (UDP) artifact and a transmission control protocol (TCP) artifact;
      
      the port number type comprises a port 53 artifact, a port 80 artifact, a port 442 artifact, a port 8080 artifact, and a port 8000 artifact;
      
      the size type comprises a first quartile artifact, a second quartile artifact, a third quartile artifact and a fourth quartile artifact;
      
      the DNS record type comprises an A record artifact, an MX record artifact, and an SDA record artifact; and
      
      the HTTP method type comprises a GET method artifact, a POST record artifact, and a HEAD record artifact.
  - 5. The computer-implemented method of claim 1, wherein the assigning comprises executing a respective executable file from the training corpus in a virtualized environment.
  - 6. The computer-implemented method of claim 1, wherein the assigning comprises observing network traffic.
  - 7. The computer-implemented method of claim 1, wherein the machine learning system comprises at least one of:
    - a support vector machine, a decision tree, and a k-nearest-neighbor classifier.
  - 8. The computer-implemented method of claim 1, wherein the outputting comprises at least one of:
    - displaying, storing in persistent storage, and providing to a software module.
  - 9. The computer-implemented method of claim 1, wherein the generating the feature vector for the unknown executable file comprises executing the unknown executable file in a virtualized environment.
  - 10. The computer-implemented method of claim 1, wherein the generating the feature vector for the unknown executable file comprises observing network traffic of the unknown executable file.
  - 11. The computer-implemented method of claim 1, wherein the classifying the unknown executable file as one of likely benign and likely malware comprises obtaining a classification of the unknown executable file as likely malware, the method further comprising blocking network traffic from an entity associated with the unknown executable file.

12. A non-transitory computer-readable storage medium storing instructions that, when executed by one or more processors, cause the one or more processors to perform the steps of:
- generating network behavioral artifacts for each executable file included in a training corpus comprising one or more executable files classified as benign and one or more executable files classified as malware;
  
  assigning, for each executable file included in the training corpus, a respective string of character sets to represent the network behavioral artifacts generated for the executable file;
  
  forming, for each executable file included in the training corpus, a respective feature vector based on the respective string of character sets, wherein the respective feature vector indicates which contiguous character substrings included in a plurality of contiguous character substrings appear in the respective string of character sets;
  
  training a machine learning system based on the respective feature vectors;
  
  generating a feature vector for an unknown executable file;
  
  classifying, by the machine learning system, the unknown executable file as one of likely benign and likely malware based on the feature vector for the unknown executable file; and
  
  outputting the classification of the unknown executable file.
- View Dependent Claims (13, 14, 15, 16, 17, 18)
- - 13. The non-transitory computer-readable storage medium of claim 12, further comprising:
    - obtaining for each of the plurality of strings of character sets and for a fixed n>
      
      1, a respective set of contiguous substrings of length n; and
      
      ordering a union of the respective sets of contiguous substrings of length n, whereby an ordered universe of contiguous substrings of length n is obtained,wherein the respective feature vector indicates how many instances of the contiguous substrings of length n appear in the respective string of character sets.
  - 14. The non-transitory computer-readable storage medium of claim 12, wherein the assigning comprises executing a respective executable file from the training corpus in a virtualized environment.
  - 15. The non-transitory computer-readable storage medium of claim 12, wherein the assigning comprises observing network traffic.
  - 16. The non-transitory computer-readable storage medium of claim 12, wherein the generating the feature vector for the unknown executable file comprises executing the unknown executable file in a virtualized environment.
  - 17. The non-transitory computer-readable storage medium of claim 12, wherein the generating the feature vector for the unknown executable file comprises observing network traffic of the unknown executable file.
  - 18. The non-transitory computer-readable storage medium of claim 12, wherein the classifying the unknown executable file as one of likely benign and likely malware comprises obtaining a classification of the unknown executable file as likely malware, the method further comprising blocking network traffic from an entity associated with the unknown executable file.

19. A system for determining whether an executable file is malware by using network behavioral artifacts, the system comprising:
- one or more memories that include instructions; and
  
  one or more processors that are coupled to the one or more memories and, when executing the instructions, are configured to;
  
  generate network behavioral artifacts for each executable file included in a training corpus comprising one or more executable files classified as benign and one or more executable files classified as malware;
  
  assign for each executable file included in the training corpus, a respective string of character sets to represent the network behavioral artifacts generated for the executable file;
  
  for each executable file included in the training corpus, forming a respective feature vector based on the respective string of character sets by ordering a plurality of contiguous substrings appearing in the respective string of character sets based on at least one characteristic of one or more characters included in the contiguous character substrings;
  
  train a machine learning system based on the respective feature vectors;
  
  generate a feature vector for an unknown executable file;
  
  classify, by the machine learning system, the unknown executable file as one of likely benign and likely malware based on the feature vector for the unknown executable file; and
  
  output the classification of the unknown executable file.
- View Dependent Claims (20)
- - 20. The system of claim 19, wherein the one or more processors are further configured to:
    - obtain for each of the plurality of strings of character sets and for a fixed n>
      
      1, a respective set of contiguous substrings of length n; and
      
      order a union of the respective sets of contiguous substrings of length n, whereby an ordered universe of contiguous substrings of length n is obtained,wherein the respective feature vector indicates how many instances of the contiguous substrings of length n appear in the respective string of character sets.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
VeriSign, Inc.
Original Assignee
VeriSign, Inc.
Inventors
Mankin, Allison, Mohaisen, Abedelaziz, Tonn, Trevor
Primary Examiner(s)
Zand, Kambiz
Assistant Examiner(s)
Getachew, Abiy

Application Number

US15/346,694
Publication Number

US 20170053119A1
Time in Patent Office

329 Days
Field of Search

726 23, 726 24
US Class Current
CPC Class Codes

G06F 21/562   Static detection

G06F 2221/034   Test or assess a computer o...

G06N 20/00   Machine learning

H04L 63/1408   by monitoring network traff...

H04L 63/145   the attack involving the pr...

Classifying malware by order of network behavior artifacts

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

20 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Classifying malware by order of network behavior artifacts

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

20 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links