Decision-tree-based symbolic rule induction system for text categorization
First Claim
1. A computer implemented method for generating decision-tree-based symbolic rule induction for general text categorization, wherein the text categorization provides superior performance in the areas of precision, recall multiple categorization, provision of confidence levels, training speed, and insight and control, said method comprising the steps of:
- accepting training data, the data comprising a set of training documents, wherein a document is any data item containing text;
creating a set TR of representations of the set of training documents, the representation being suitable for rule induction and being in terms of counts of occurrences of features in documents;
for each category C, generating R(C), a decision-tree-based set of symbolic rules;
combining the generated rule sets R(C) for all categories C into a single rule set, “
RuleSet”
;
computing a confidence level for each rule in RuleSet;
adding the computed confidence level to the corresponding rule; and
generating a final RuleSet comprising rules and corresponding confidence levels.
1 Assignment
0 Petitions
Accused Products
Abstract
A method to automatically categorize messages or documents containing text. The method of solution fits in the general framework of supervised learning, in which a rule or rules for categorizing data is automatically constructed by a computer on the basis of training data that has beforehand been categorized, i.e., each training data item has been labeled with the categories to which it belongs. More specifically, the method for rule induction involves the novel combination of (1) inducing from the training data a decision tree for each category, (2) automated construction from each decision tree of a simplified symbolic rule set that is logically equivalent overall to the decision tree, and which is to be used for categorization instead of the decision tree, and (3) determination of a confidence level for each rule. The method covers both decision-tree-based symbolic rule induction and the use for the purpose of document categorization of rules in the logical format of those generated by the rule induction procedure described herein.
180 Citations
21 Claims
-
1. A computer implemented method for generating decision-tree-based symbolic rule induction for general text categorization, wherein the text categorization provides superior performance in the areas of precision, recall multiple categorization, provision of confidence levels, training speed, and insight and control, said method comprising the steps of:
-
accepting training data, the data comprising a set of training documents, wherein a document is any data item containing text;
creating a set TR of representations of the set of training documents, the representation being suitable for rule induction and being in terms of counts of occurrences of features in documents;
for each category C, generating R(C), a decision-tree-based set of symbolic rules;
combining the generated rule sets R(C) for all categories C into a single rule set, “
RuleSet”
;
computing a confidence level for each rule in RuleSet;
adding the computed confidence level to the corresponding rule; and
generating a final RuleSet comprising rules and corresponding confidence levels. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
a) growing the decision tree starting from a root node corresponding to the set of all training documents by recursively splitting the set of training documents corresponding to a node under consideration, the first such node being the root node, into two subsets of documents, each subset of documents corresponding to a child of the node under consideration being split until no further progress can be made, where a minimum cost partition is used for each split; and
b) pruning the decision tree using tree smoothing.
-
-
3. A method as recited in claim 2, wherein the minimum cost partition at a node is computed from two 2-dimensional arrays, “
- inclass-count[f][ν
]” and
“
total-count[f][ν
]”
, wherein inclass-count[f][ν
] is a quantity of documents at a current node under consideration that are in a class C and for which a truncated count of occurrences of feature f is ν
, and total-count[f][ν
] is a total quantity of documents at the current node under consideration for which the truncated count of occurrences of feature f is ν
, where f is an integer in the range of 1 to d and ν
is an integer in the range 1 to V, and where d is a quantity of all features under consideration and V is selected as a maximum count recorded for the number of occurrences of a particular feature in a document.
- inclass-count[f][ν
-
4. A method as recited in claim 3, wherein V is in the range of 3 to 10.
-
5. A method as recited in claim 1, wherein the confidence level for each rule in RuleSet is computed using a precision of each individual rule as measured on a set of data which has been labeled with categories, wherein the data set used is one of the training data used for rule induction and a separate data set selected for measuring confidence levels of the rules.
-
6. A method as recited in claim 1, wherein determination of the confidence levels for rules does is performed concurrently with the step of combining the generated rule sets R(C) for all categories C.
-
7. A method as recited in claim 1, wherein the step of creating a set TR of representations of the set of training documents, further comprises the steps of:
-
inputting a document d;
segmenting the document d into sections, the sections being significant for categorization;
tokenizing each section of the document d that contains text; and
generating a representation r of the tokenized document d from which a list of tokens in each section is determined.
-
-
8. A method as recited in claim 7, further comprising the step of performing at least one of:
-
(i) converting all tokens to canonical form; and
(ii) deleting stopwords not useful for categorization.
-
-
9. A method as recited in claim 1, wherein the step of generating R(C) further comprises the steps of:
-
specifying a category C for which a set of decision-tree-based symbolic rules is to be induced;
inputting a set TR of representations of tokenized training documents, each representation having a set of features;
creating a list L(C) of features selected from the collection of all features occurring in the TR in order to construct a representation of each document that is specific to the category C;
creating a set D(C) of category specific representations of tokenized training documents based on the list L(C) of selected features;
inducing a decision tree T(C) for the category C; and
converting the decision tree T(C) into a logically equivalent set of rules R(C).
-
-
10. A method as recited in claim 9, wherein the representation of a document specific to the category C is a numerical vector representation of the document, where each component of the numerical vector is a numerical value corresponding to a single specific feature in L(C).
-
11. A method as recited in claim 10, wherein the numerical representation is limited to the numerical components of a vector to being one of the consecutive integers 0, 1, 2, . . . , V, such that:
-
(i) an integer θ
less than V signifies that the corresponding feature in L(C) occurs in a document exactly ν
times, and(ii) V signifies that the corresponding feature in L(C) occurs in a document at least V times.
-
-
12. A method as recited in claim 9, wherein the decision tree T(C) is converted into a simplified set of rules.
-
13. A method as recited in claim 9, wherein the step of inducing a decision tree T(C), further comprises the steps of:
-
inputting a category C and the set D(C) of category specific representations of tokenized training documents based on the list L(C) of selected features relevant to category C;
employing a greedy algorithm to grow a decision tree T(C), the tree T(C) classifying the training data, the algorithm taking advantage of sparse data using a modified entropy function to measure impurity of a split at a tree node, thereby determining a test used at a node; and
pruning the decision tree T(C) by smoothing, thereby obtaining a decision tree T(C).
-
-
14. A method as recited in claim 9, wherein the step of converting the decision tree T(C) into a logically equivalent set of rules R(C), further comprises the steps of:
-
inputting a decision tree T(C) where leaf nodes correspond to one of membership in the category C and to non-membership in the category C;
assembling into a list N those nodes of T(C) corresponding to membership in category C; and
for each node B in N, generating a rule of the form;
ANTECEDENT→
C.
-
-
15. A method as recited in claim 14, wherein construction of the ANTECEDENT comprises the steps of:
-
for each node A, on a path from B to the root of the tree, marking the parent of node A as possibly contributing to the conjunction in the ANTECEDENT only if node A satisfies Sibling Criterion;
forming a conjunction of tests, or negations of the tests, depending on which branch of the tree is taken on the path from the root to B, corresponding to each node marked in the marking step, resulting in ANTECEDENT1; and
identifying a simplified equivalent test by combining similar terms in ANTECEDENT1, resulting in ANTECEDENT.
-
-
16. A method of categorizing text using decision-tree-based symbolic rule induction categorization rules, said method for categorization comprising the steps of:
-
inputting a rule set, “
RuleSet”
, the rule set being a set of rules formed by induced decision-tree-based symbolic rule induction,inputting a document d;
creating a representation r, of document d, in a manner corresponding to the step of creating a set of representations of a set of training documents as used in rule induction resulting in RuleSet; and
returning all categories and confidence levels from selected rules in RuleSet, wherein the selected rules have a true antecedent when applied to r, thereby resulting in a categorization of r, wherein the rules of RuleSet are induced by;
accepting training data, the data comprising a set of training documents, wherein a document is any data item containing text, creating a set TR of representations of the set of training documents, the representation being suitable for rule induction, for each category C, generating R(C), a decision-tree-based set of symbolic rules, combining the generated rule sets R(C) for all categories C into a single rule set, “
RuleSet”
,computing a confidence level for each rule in RuleSet, adding the computed confidence level to the corresponding rule, and generating a final RuleSet comprising rules and corresponding confidence levels.
-
-
17. A system for generating decision-tree-based symbolic rule induction for general text categorization, wherein the text categorization provides superior performance in the areas of precision, recall multiple categorization, provision of confidence levels, training speed, and insight and control, comprising:
-
a computing device having memory, non-volatile storage, processing unit, and means for input/output;
means for accepting training data by the computing device, the data comprising a set of training documents, wherein a document is any data item containing text;
means for creating a set TR of representations of the set of training documents to be further processed by the computing device processing unit, the representation being suitable for rule induction and being in terms of counts of occurrences of features in documents;
means for generating R(C), a decision-tree-based set of symbolic rules, for each category C, wherein the generating means utilizes the set of TR representations;
means for combining the generated rule sets R(C) for all categories C into a single rule set, “
RuleSet”
;
means for computing a confidence level for each rule in RuleSet;
means for adding the computed confidence level to the corresponding rule; and
means for generating a final RuleSet comprising rules and corresponding confidence levels, wherein the final RuleSet is stored in the computing device non-volatile storage unit for later use. - View Dependent Claims (18, 19, 20, 21)
-
Specification