Data mining method and system for generating a decision tree classifier for data records based on a minimum description length (MDL) and presorting of records
First Claim
1. A method for generating a decision tree classifier from a training set of records, each record having at least one numeric attribute and a class label of a class to which the record belongs, the method comprising the steps of:
- pre-sorting the records based on the numeric attribute;
generating an attribute list for each attribute and a class list, each entry of the class list corresponding to a class label, the attribute list including values of the attribute and indices to the class labels;
creating a decision tree from the pre-sorted records, attribute lists, and class list, using a breadth-first process, the decision tree having a root node, a plurality of interior nodes, and a plurality of leaf nodes, the breadth-first process being such that the nodes of a same depth from the root node are formed in parallel; and
pruning the decision tree based on a Minimum Description Length (MDL) scheme to obtain the decision tree classifier, the MDL scheme encoding the decision tree as a model such that an encoding cost for describing the decision tree and the training set is minimized, and the method taking into account the encoding cost in the pruning.
1 Assignment
0 Petitions
Accused Products
Abstract
A method and apparatus are disclosed for generating a decision tree classifier from a training set of records. The method comprises the steps of: pre-sorting the records based on each numeric record attribute, creating a decision tree breadth-first, and pruning the tree based on the MDL principle. Preferably, the pre-sorting includes generating a class list and attribute lists, and independently sorting the numeric attribute lists. The growing of the tree includes evaluating possible splitting criteria and selecting a splitting test for each leaf node, based on a splitting index, and updating the class list to reflect new leaf nodes. In a preferred embodiment, the splitting index is a gini index. The pruning preferably includes encoding the decision tree and splitting tests in an MDL-based code, and determining whether to convert a node into a leaf node, prune its child nodes, or leave the node intact, based on the code length of the node.
130 Citations
42 Claims
-
1. A method for generating a decision tree classifier from a training set of records, each record having at least one numeric attribute and a class label of a class to which the record belongs, the method comprising the steps of:
-
pre-sorting the records based on the numeric attribute; generating an attribute list for each attribute and a class list, each entry of the class list corresponding to a class label, the attribute list including values of the attribute and indices to the class labels; creating a decision tree from the pre-sorted records, attribute lists, and class list, using a breadth-first process, the decision tree having a root node, a plurality of interior nodes, and a plurality of leaf nodes, the breadth-first process being such that the nodes of a same depth from the root node are formed in parallel; and pruning the decision tree based on a Minimum Description Length (MDL) scheme to obtain the decision tree classifier, the MDL scheme encoding the decision tree as a model such that an encoding cost for describing the decision tree and the training set is minimized, and the method taking into account the encoding cost in the pruning. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
-
-
15. A computer program product for use with a computer system for generating a decision tree classifier from a training set of records, each record having at least one numeric attribute and a class label of a class to which the record belongs, the computer program product comprising:
-
a recording medium; means, recorded on the recording medium, for directing the computer system to pre-sort the records based on the numeric attribute; means, recorded on the recording medium, for directing the computer system to generate an attribute list for each attribute and a class list, each entry of the class list corresponding to a class label, the attribute list including values of the attribute and indices to the class labels; means, recorded on the recording medium, for directing the computer system to create a decision tree from the pre-sorted records, attribute lists, and class list, using a breadth-first process, the decision tree having a root node, a plurality of interior nodes, and a plurality of leaf nodes, the breadth-first process being such that the nodes of a same depth from the root node are created in parallel; and means, recorded on the recording medium for directing the computer system to prune the decision tree based on a Minimum Description Length (MDL) scheme to obtain the decision tree classifier, the MDL scheme encoding the decision tree as a model such that an encoding cost for describing the decision tree and the training set is minimized, and the method taking into account the encoding cost in the pruning. - View Dependent Claims (16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28)
-
-
29. A computer-based system for generating a decision tree classifier from a training set of records, each record having at least one numeric attribute and a class label of a class to which the record belongs, the system comprising:
-
means for pre-sorting the records based on the values of the numeric attribute; means for generating an attribute list for each attribute and a class list, each entry of the class list corresponding to a class label, the attribute list including values of the attribute and indices to the class labels; means for creating a decision tree from the pre-sorted records, attribute lists, and class list, using a breadth-first process, the decision tree having a root node, a plurality of interior nodes, and a plurality of leaf nodes, the breadth-first process being such that the nodes of a same depth from the root node are formed in parallel; and means for pruning the decision tree based on a Minimum Description Length (MDL) scheme to obtain the decision tree classifier, the MDL scheme encoding the decision tree as a model such that an encoding cost for describing the decision tree and the training set is minimized, and the method taking into account the encoding cost in the pruning. - View Dependent Claims (30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42)
-
Specification