Detection of code-based malware

US 8,713,679 B2
Filed: 02/18/2011
Issued: 04/29/2014
Est. Priority Date: 02/18/2011
Status: Active Grant

First Claim

Patent Images

1. A system comprising:

one or more hardware processors; and

one or more computer-readable storage media storing computer-executable instructions that are executable by the one or more hardware processors to cause the system to perform operations including;

determining code contexts from known malicious script and known benign script;

building abstract syntax trees (ASTs) using code found in the code contexts;

extracting structural features from the known malicious script and known benign script based on structures and contents of the ASTs, the structural features being different from text of the known malicious script and the known benign script;

comparing structural features from unclassified script with the structural features from the known malicious script and the known benign script; and

classifying the unclassified script as malicious or benign based on the comparison of the structural features from the unclassified script with the structural features from the known malicious script and the known benign script.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

This document describes techniques for detection of code-based malware. According to some embodiments, the techniques utilize a collection of known malicious code and know benign code and determine which features of each type of code can be used to determine whether unclassified code is malicious or benign. The features can then be used to train a classifier (e.g., a Bayesian classifier) to characterize unclassified code as malicious or benign. In at least some embodiments, the techniques can be used as part of and/or in cooperation with a web browser to inspect web content (e.g., a web page) to determine if the content includes code-based malware.

Citations

18 Claims

1. A system comprising:
- one or more hardware processors; and
  
  one or more computer-readable storage media storing computer-executable instructions that are executable by the one or more hardware processors to cause the system to perform operations including;
  
  determining code contexts from known malicious script and known benign script;
  
  building abstract syntax trees (ASTs) using code found in the code contexts;
  
  extracting structural features from the known malicious script and known benign script based on structures and contents of the ASTs, the structural features being different from text of the known malicious script and the known benign script;
  
  comparing structural features from unclassified script with the structural features from the known malicious script and the known benign script; and
  
  classifying the unclassified script as malicious or benign based on the comparison of the structural features from the unclassified script with the structural features from the known malicious script and the known benign script.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The system as recited in claim 1, wherein the structural features from one or more of the known malicious script or the known benign script include one or more of a loop, a function, a conditional, a string, a variable declaration, or a try/catch block.
  - 3. The system as recited in claim 1, wherein classifying the unclassified script as malicious or benign comprises using a state machine to match the structural features from the unclassified script with the structural features from one of the known malicious script or the known benign script.
  - 4. The system as recited in claim 1, wherein classifying the unclassified script as malicious or benign comprises calculating a probability or another numeric score indicating that the unclassified script is malicious or benign.
  - 5. The system as recited in claim 4, wherein the probability or another numeric score that the unclassified script is malicious or benign is calculated using a Bayesian classifier.
  - 6. The system as recited in claim 1, wherein the structural features of the unclassified script are determined by:
    - de-obfuscating the unclassified script;
      
      building one or more abstract syntax trees (ASTs) using the de-obfuscated unclassified script; and
      
      determining the structural features of the unclassified script based one or more of the contents or the structure of the one or more ASTs.

7. A computer-implemented method comprising:
- extracting a first set of structural features from known code by;
  
  unfolding the known code to determine code contexts associated with the known code;
  
  building one or more abstract syntax trees (ASTs) using the code contexts; and
  
  determining the first set of features based on the structure of the one or more ASTs;
  
  extracting a second set of structural features from the first set of structural features based on a determination of which features of the first set of structural features are predictive of a particular code classification, the second set of structural features being a subset of the first set of structural features and excluding one or more features of the first set of structural features that are determined not to be predictive of a particular code classification;
  
  training a classifier using the second set of structural features; and
  
  classifying with the classifier unclassified code based at least in part on the second set of structural features.
- View Dependent Claims (8, 9, 10, 11, 12)
- - 8. The method as recited in claim 7, wherein the known code comprises known malicious code and known benign code, the particular code classification comprises classifying code as malicious or benign, and wherein classifying the unclassified code comprises classifying the unclassified code as malicious or benign.
  - 9. The method as recited in claim 7, wherein the determination of which features of the first set of features are predictive of the particular code classification is based on an analysis of the features of the first set of features using a χ
    - ²algorithm.
  - 10. The method as recited in claim 7, wherein the classifier is configured to be implemented in a web browsing environment.
  - 11. The method as recited in claim 7, wherein classifying the unclassified code comprises using a state machine to match one or more features from the unclassified code with one or more features from the second set of features.
  - 12. The method as recited in claim 7, further comprising updating the classifier with a third set of features from different known code.

13. A computer-implemented method comprising:
- building an abstract syntax tree (AST) using code contexts retrieved from one of a known malicious script or a known benign script;
  
  determining features of the known malicious script or the known benign script based on the structure and textual contents of the AST;
  
  matching features of an unclassified script to the features of the known malicious script or the known benign script; and
  
  classifying the unclassified script as malicious or benign based on the matching.
- View Dependent Claims (14, 15, 16, 17, 18)
- - 14. The method as recited in claim 13, wherein the code contexts comprise fragments of code from one of the known malicious script or the known benign script.
  - 15. The method as recited in claim 13, wherein building the AST comprises:
    - de-obfuscating the one of the known malicious script or the known benign script; and
      
      running the de-obfuscated known malicious script or known benign script to determine the code contexts.
  - 16. The method as recited in claim 13, wherein the features of the known malicious script or the known benign script comprise one or more of structural features or content features.
  - 17. The method as recited in claim 13, wherein the features of the known malicious script or the known benign script are used to train a classifier, and wherein classifying the unclassified script as malicious or benign is implemented by the classifier.
  - 18. The method as recited in claim 17, wherein the classifier is configured to classify the unclassified script as malicious or benign by calculating a probability or a numeric score that the unclassified script is malicious or benign.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Zorn, Benjamin Goth, Livshits, Benjamin, Curtsinger, Charles M., Seifert, Christian
Primary Examiner(s)
Truong, Thanhnga B
Assistant Examiner(s)
Jeudy, Josnel

Application Number

US13/031,061
Publication Number

US 20120216280A1
Time in Patent Office

1,166 Days
Field of Search

726 22- 26, 713187-188
US Class Current

726/23
CPC Class Codes

G06N 7/01 Probabilistic graphical mod...

Detection of code-based malware

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Detection of code-based malware

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links