Rule induction on large noisy data sets
First Claim
1. A method practiced in a computer system which includes a processor and a memory system of inducing sets of classification logic rules for classifying data items from an example dataset of the data items, the sets of classification logic rules and the example dataset being stored in the memory system and the method comprising the steps performed in the processor of:
- inducing a first rule set from the example dataset according to a predetermined method, the first rule set being substantially smaller than a largest rule set producible by the predetermined method, and storing the first rule set in the memory system; and
optimizing the first rule set with regard to the largest rule set to produce a second rule set.
4 Assignments
0 Petitions
Accused Products
Abstract
Efficient techniques for inducing rules used in classifying data items on a noisy data set. The prior-art IREP technique, which produces a set of classification rules by inducing each rule and then pruning it and continuing thus until a stopping condition is reached, is improved with a new rule-value metric for stopping pruning and with a stopping condition which depends on the description length of the rule set. The rule set which results from the improved IREP technique is then optimized by pruning rules from the set to minimize the description length and further optimized by making a replacement rule and a modified rule for each rule and using the description length to determine whether to use the replacement rule, the modified rule, or the original rule in the rule set. Further improvement is achieved by inducing rules for data items not covered by the original set and then pruning these rules. Still further improvement is gained by repeating the steps of inducing rules for data items not covered, pruning the rules, optimizing the rules, and again pruning for a fixed number of times. The fully-developed technique has the O(nlog2 n) running time characteristic of IREP, but produces rule sets which do a substantially better job of classification than those produced by IREP.
-
Citations
18 Claims
-
1. A method practiced in a computer system which includes a processor and a memory system of inducing sets of classification logic rules for classifying data items from an example dataset of the data items, the sets of classification logic rules and the example dataset being stored in the memory system and the method comprising the steps performed in the processor of:
-
inducing a first rule set from the example dataset according to a predetermined method, the first rule set being substantially smaller than a largest rule set producible by the predetermined method, and storing the first rule set in the memory system; and optimizing the first rule set with regard to the largest rule set to produce a second rule set. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
-
-
16. A method practiced in a computer system which includes a processor and a memory system of inducing a set of classification logic rules for classifying data items from an example dataset of the data items, the rules and the example dataset being stored in the memory system and the method comprising the steps performed in the processor for each rule of:
-
inducing the rule on the example dataset; adding the rule to the set of classification logic rules; computing a description length of the set of classification logic rules with the added rule; and terminating the method if the description length satisfies a predetermined condition. - View Dependent Claims (17, 18)
-
Specification