Decision tree induction that is sensitive to attribute computational complexity
First Claim
1. A computer-implemented method for constructing a decision tree for classifying computer files based on the computational complexities of attributes of the files, comprising:
- creating a plurality of attribute vectors for a plurality of training files of known classification, each attribute vector comprising values of a predetermined set of attributes for an associated training file;
determining a complexity score for each attribute in the predetermined set of attributes, the complexity score measuring a cost associated with determining a value of an associated attribute for a file; and
growing a decision tree based on the plurality of attribute vectors, comprising;
(1) setting the plurality of attribute vectors as a current set,(2) determining a weighted impurity reduction score for at least one attribute of the predetermined set of attributes based on the complexity score of the attribute, the weighted impurity reduction score quantifying a cost-benefit tradeoff for an associated attribute in classifying the current set,(3) selecting a splitting attribute from the at least one attribute of the predetermined set of attributes,(4) splitting the current set into subsets using the splitting attribute, and(5) repeating steps (2) through (4) for each of the subsets as the current set.
2 Assignments
0 Petitions
Accused Products
Abstract
A decision tree for classifying computer files is constructed. Computational complexities of a set of candidate attributes are determined. A set of attribute vectors are created for a set of training files with known classification. A node is created to represent the set. A weighted impurity reduction score is calculated for each candidate attribute based on the computational complexity of the attribute. If a stopping criterion is satisfied then the node is set as a leaf node. Otherwise the node is set as a branch node and the attribute with the highest weighted impurity reduction score is selected as the splitting attribute for the branch node. The set of attribute vectors are split into subsets based on their attribute values of the splitting attribute. The above process is repeated for each subset. The tree is then pruned based on the computational complexities of the splitting attributes.
35 Citations
20 Claims
-
1. A computer-implemented method for constructing a decision tree for classifying computer files based on the computational complexities of attributes of the files, comprising:
-
creating a plurality of attribute vectors for a plurality of training files of known classification, each attribute vector comprising values of a predetermined set of attributes for an associated training file; determining a complexity score for each attribute in the predetermined set of attributes, the complexity score measuring a cost associated with determining a value of an associated attribute for a file; and growing a decision tree based on the plurality of attribute vectors, comprising; (1) setting the plurality of attribute vectors as a current set, (2) determining a weighted impurity reduction score for at least one attribute of the predetermined set of attributes based on the complexity score of the attribute, the weighted impurity reduction score quantifying a cost-benefit tradeoff for an associated attribute in classifying the current set, (3) selecting a splitting attribute from the at least one attribute of the predetermined set of attributes, (4) splitting the current set into subsets using the splitting attribute, and (5) repeating steps (2) through (4) for each of the subsets as the current set. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A computer system for constructing a decision tree for classifying computer files based on the computational complexities of attributes of the files, comprising:
-
a computer-readable storage medium storing executable computer program code; and a processor for executing the executable computer program code, wherein said executable computer program code comprising; an attribute complexity determination module for creating a plurality of attribute vectors for a plurality of training files of known classification, each attribute vector comprising values of a predetermined set of attributes for an associated training file, and determining a complexity score for each attribute in the predetermined set of attributes, the complexity score measuring a cost associated with determining a value of an associated attribute for a file; and a decision tree construction module for growing a decision tree based on the plurality of attribute vectors, comprising;
(1) setting the plurality of attribute vectors as a current set, (2) determining a weighted impurity reduction score for at least one attribute of the predetermined set of attributes based on the complexity score of the attribute, the weighted impurity reduction score quantifying a cost-benefit tradeoff for an associated attribute in classifying the current set, (3) selecting a splitting attribute from the at least one attribute of the predetermined set of attributes, (4) splitting the current set into subsets using the splitting attribute, and (5) repeating steps (2) through (4) for each of the subsets as the current set. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. A non-transitory computer-readable storage medium encoded with executable computer program code for constructing a decision tree for classifying computer files based on the computational complexities of attributes of the files, the computer program code comprising program code when executes by a processor causes a computer to perform the steps:
-
creating a plurality of attribute vectors for a plurality of training files of known classification, each attribute vector comprising values of a predetermined set of attributes for an associated training file; determining a complexity score for each attribute in the predetermined set of attributes, the complexity score measuring a cost associated with determining a value of an associated attribute for a file; and growing a decision tree based on the plurality of attribute vectors, comprising;
(1) setting the plurality of attribute vectors as a current set, (2) determining a weighted impurity reduction score for at least one attribute of the predetermined set of attributes based on the complexity score of the attribute, the weighted impurity reduction score quantifying a cost-benefit tradeoff for an associated attribute in classifying the current set, (3) selecting a splitting attribute from the at least one attribute of the predetermined set of attributes, (4) splitting the current set into subsets using the splitting attribute, and (5) repeating steps (2) through (4) for each of the subsets as the current set. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification