Decision tree induction that is sensitive to attribute computational complexity

US 8,190,647 B1
Filed: 09/15/2009
Issued: 05/29/2012
Est. Priority Date: 09/15/2009
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method for constructing a decision tree for classifying computer files based on the computational complexities of attributes of the files, comprising:

creating a plurality of attribute vectors for a plurality of training files of known classification, each attribute vector comprising values of a predetermined set of attributes for an associated training file;

determining a complexity score for each attribute in the predetermined set of attributes, the complexity score measuring a cost associated with determining a value of an associated attribute for a file; and

growing a decision tree based on the plurality of attribute vectors, comprising;

(1) setting the plurality of attribute vectors as a current set,(2) determining a weighted impurity reduction score for at least one attribute of the predetermined set of attributes based on the complexity score of the attribute, the weighted impurity reduction score quantifying a cost-benefit tradeoff for an associated attribute in classifying the current set,(3) selecting a splitting attribute from the at least one attribute of the predetermined set of attributes,(4) splitting the current set into subsets using the splitting attribute, and(5) repeating steps (2) through (4) for each of the subsets as the current set.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A decision tree for classifying computer files is constructed. Computational complexities of a set of candidate attributes are determined. A set of attribute vectors are created for a set of training files with known classification. A node is created to represent the set. A weighted impurity reduction score is calculated for each candidate attribute based on the computational complexity of the attribute. If a stopping criterion is satisfied then the node is set as a leaf node. Otherwise the node is set as a branch node and the attribute with the highest weighted impurity reduction score is selected as the splitting attribute for the branch node. The set of attribute vectors are split into subsets based on their attribute values of the splitting attribute. The above process is repeated for each subset. The tree is then pruned based on the computational complexities of the splitting attributes.

35 Citations

View as Search Results

20 Claims

1. A computer-implemented method for constructing a decision tree for classifying computer files based on the computational complexities of attributes of the files, comprising:
- creating a plurality of attribute vectors for a plurality of training files of known classification, each attribute vector comprising values of a predetermined set of attributes for an associated training file;
  
  determining a complexity score for each attribute in the predetermined set of attributes, the complexity score measuring a cost associated with determining a value of an associated attribute for a file; and
  
  growing a decision tree based on the plurality of attribute vectors, comprising;
  
  (1) setting the plurality of attribute vectors as a current set,(2) determining a weighted impurity reduction score for at least one attribute of the predetermined set of attributes based on the complexity score of the attribute, the weighted impurity reduction score quantifying a cost-benefit tradeoff for an associated attribute in classifying the current set,(3) selecting a splitting attribute from the at least one attribute of the predetermined set of attributes,(4) splitting the current set into subsets using the splitting attribute, and(5) repeating steps (2) through (4) for each of the subsets as the current set.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The computer-implemented method of claim 1, wherein growing the decision tree further comprises:
    - (6) responsive to a stopping criterion being satisfied, creating a leaf node for the current set, the leaf node representing a classification of training files associated with the current set; and
      
      (7) responsive to no stopping criterion being satisfied, creating a branch node for the current set, the splitting attribute being associated with the branch node.
  - 3. The computer-implemented method of claim 1, wherein determining the weighted impurity reduction score comprises:
    - determining an impurity reduction score for the at least one of the predetermined set of attributes, wherein the impurity reduction score for an attribute measures how well the attribute separates the current set and the weighted impurity reduction score for the attribute is determined based on the impurity reduction score and the complexity score of the attribute.
  - 4. The computer-implemented method of claim 1, wherein determining the weighted impurity reduction score comprises:
    - determining a weight value for each of the at least one of the predetermined set of attributes by applying a weight function to the complexity score of each of the at least one of the predetermined set of attributes, the weight value for an attribute measuring a significance of the complexity score of the attribute for the current set, wherein the weighted impurity reduction score for the attribute is determined based on the weight value of the attribute.
  - 5. The computer-implemented method of claim 4, wherein the weight function takes into consideration a size of the current set and a depth of a node for the current set in the decision tree to generate the weight value.
  - 6. The computer-implemented method of claim 1, further comprising:
    - pruning the decision tree based on a plurality of examining files and the complexity scores for splitting attributes of the decision tree.
  - 7. The computer-implemented method of claim 1, wherein the known classification of the training files comprises legitimate file and malware, and wherein the decision tree is used by a client system to detect malware in the client system.

8. A computer system for constructing a decision tree for classifying computer files based on the computational complexities of attributes of the files, comprising:
- a computer-readable storage medium storing executable computer program code; and
  
  a processor for executing the executable computer program code, wherein said executable computer program code comprising;
  
  an attribute complexity determination module for creating a plurality of attribute vectors for a plurality of training files of known classification, each attribute vector comprising values of a predetermined set of attributes for an associated training file, and determining a complexity score for each attribute in the predetermined set of attributes, the complexity score measuring a cost associated with determining a value of an associated attribute for a file; and
  
  a decision tree construction module for growing a decision tree based on the plurality of attribute vectors, comprising;
  
  (1) setting the plurality of attribute vectors as a current set, (2) determining a weighted impurity reduction score for at least one attribute of the predetermined set of attributes based on the complexity score of the attribute, the weighted impurity reduction score quantifying a cost-benefit tradeoff for an associated attribute in classifying the current set, (3) selecting a splitting attribute from the at least one attribute of the predetermined set of attributes, (4) splitting the current set into subsets using the splitting attribute, and (5) repeating steps (2) through (4) for each of the subsets as the current set.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The computer system of claim 8, wherein growing the decision tree further comprises:
    - (6) responsive to a stopping criterion being satisfied, creating a leaf node for the current set, the leaf node representing a classification of training files associated with the current set; and
      
      (7) responsive to no stopping criterion being satisfied, creating a branch node for the current set, the splitting attribute being associated with the branch node.
  - 10. The computer system of claim 8, wherein determining the weighted impurity reduction score comprises:
    - determining an impurity reduction score for the at least one of the predetermined set of attributes, wherein the impurity reduction score for an attribute measures how well the attribute separates the current set and the weighted impurity reduction score for the attribute is determined based on the impurity reduction score and the complexity score of the attribute.
  - 11. The computer system of claim 8, wherein determining the weighted impurity reduction score comprises:
    - determining a weight value for each of the at least one of the predetermined set of attributes by applying a weight function to the complexity score of each of the at least one of the predetermined set of attributes, the weight value for an attribute measuring a significance of the complexity score of the attribute for the current set, wherein the weighted impurity reduction score for the attribute is determined based on the weight value of the attribute.
  - 12. The computer system of claim 11, wherein the weight function takes into consideration a size of the current set and a depth of a node for the current set in the decision tree to generate the weight value.
  - 13. The computer system of claim 8, wherein the decision tree construction module is further configured for pruning the decision tree based on a plurality of examining files and the complexity scores for splitting attributes of the decision tree.
  - 14. The computer system of claim 8, wherein the known classification of the training files comprises legitimate file and malware, and wherein the decision tree is used by a client system to detect malware in the client system.

15. A non-transitory computer-readable storage medium encoded with executable computer program code for constructing a decision tree for classifying computer files based on the computational complexities of attributes of the files, the computer program code comprising program code when executes by a processor causes a computer to perform the steps:
- creating a plurality of attribute vectors for a plurality of training files of known classification, each attribute vector comprising values of a predetermined set of attributes for an associated training file;
  
  determining a complexity score for each attribute in the predetermined set of attributes, the complexity score measuring a cost associated with determining a value of an associated attribute for a file; and
  
  growing a decision tree based on the plurality of attribute vectors, comprising;
  
  (1) setting the plurality of attribute vectors as a current set, (2) determining a weighted impurity reduction score for at least one attribute of the predetermined set of attributes based on the complexity score of the attribute, the weighted impurity reduction score quantifying a cost-benefit tradeoff for an associated attribute in classifying the current set, (3) selecting a splitting attribute from the at least one attribute of the predetermined set of attributes, (4) splitting the current set into subsets using the splitting attribute, and (5) repeating steps (2) through (4) for each of the subsets as the current set.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The computer-readable storage medium of claim 15, wherein growing the decision tree further comprises:
    - (6) responsive to a stopping criterion being satisfied, creating a leaf node for the current set, the leaf node representing a classification of training files associated with the current set; and
      
      (7) responsive to no stopping criterion being satisfied, creating a branch node for the current set, the splitting attribute being associated with the branch node.
  - 17. The computer-readable storage medium of claim 15, wherein determining the weighted impurity reduction score comprises:
    - determining an impurity reduction score for the at least one of the predetermined set of attributes, wherein the impurity reduction score for an attribute measures how well the attribute separates the current set and the weighted impurity reduction score for the attribute is determined based on the impurity reduction score and the complexity score of the attribute.
  - 18. The computer-readable storage medium of claim 15, wherein determining the weighted impurity reduction score comprises:
    - determining a weight value for each of the at least one of the predetermined set of attributes by applying a weight function to the complexity score of each of the at least one of the predetermined set of attributes, the weight value for an attribute measuring a significance of the complexity score of the attribute for the current set, wherein the weighted impurity reduction score for the attribute is determined based on the weight value of the attribute.
  - 19. The computer-readable storage medium of claim 18, wherein the weight function takes into consideration a size of the current set and a depth of a node for the current set in the decision tree to generate the weight value.
  - 20. The computer-readable storage medium of claim 15, wherein the computer program code further comprises program code for:
    - pruning the decision tree based on a plurality of examining files and the complexity scores for splitting attributes of the decision tree.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
CA, Inc. (d/b/a CA Technologies) (Broadcom, Inc.)
Original Assignee
Symantec Corporation (NortonLifeLock Inc.)
Inventors
Pereira, Shane, Ramzan, Zulfikar, Satish, Sourabh
Primary Examiner(s)
Corrielus, Jean M

Application Number

US12/560,298
Time in Patent Office

987 Days
Field of Search

707/792, 707/780, 707/749, 706/20, 706/48, 706/52
US Class Current

707/792
CPC Class Codes

G06F 21/562 Static detection

G06F 21/566 Dynamic detection, i.e. det...

Decision tree induction that is sensitive to attribute computational complexity

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

35 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Decision tree induction that is sensitive to attribute computational complexity

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

35 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links