Text categorizers based on regularizing adaptations of the problem of computing linear separators
First Claim
1. A method of supervised learning for the purpose of text categorization, said method comprising the steps of:
- modifying training error for a set of categorized training documents by expressing the training error as a function which describes an ill-posed problem and replacing the ill-posed problem function with a convex function which approximates the ill-posed problem function;
deriving a convex optimization problem from the modified training error function by finding a weight vector w whose dot product with vectors derived from the training documents minimizes the modified training error function;
regularizing the convex optimization problem by adding a convex term, said convex term being dependent only on weight vector w;
solving the problem by relaxation, wherein components of the weight vector w are optimized one component at a time.
3 Assignments
0 Petitions
Accused Products
Abstract
A method to automatically categorize messages or documents containing text. The method of solution fits in the general framework of supervised learning, in which a rule or rules for categorizing data is automatically constructed by a computer on the basis of training data that has been labeled beforehand. More specifically, the method involves the construction of a linear separator: training data is used to construct for each category a weight vector w and a threshold t, and the decision of whether a hitherto unseen document d is in the category will depend on the outcome of the test wTx≧t, where x is a vector derived from the document d. The method also uses a set L of features selected from the training data in order to construct the numerical vector representation x of a document. The preferred method uses an algorithm based on Gauss-Seidel iteration to determine the weight factor w that is determined by a regularized convex optimization problem derived from the principle of minimizing modified training error.
-
Citations
22 Claims
-
1. A method of supervised learning for the purpose of text categorization, said method comprising the steps of:
-
modifying training error for a set of categorized training documents by expressing the training error as a function which describes an ill-posed problem and replacing the ill-posed problem function with a convex function which approximates the ill-posed problem function;
deriving a convex optimization problem from the modified training error function by finding a weight vector w whose dot product with vectors derived from the training documents minimizes the modified training error function;
regularizing the convex optimization problem by adding a convex term, said convex term being dependent only on weight vector w;
solving the problem by relaxation, wherein components of the weight vector w are optimized one component at a time. - View Dependent Claims (2, 3, 4, 5, 6, 7)
where f is a step function w is a weight vector, x, is a vector derived from an ith training document, yiε
{+1, −
1} is a label designating whether or not the ith document is in a particular category C, wTx is a dot product of w and x, and wherein the modified training error for a set of n training documents is defined bywhere fmod is a convex approximation of the step function f, thereby creating a convex optimization problem, said problem being to find a value w that minimizes the modified training error.
-
-
4. A method as recited in claim 3, further comprising the step of selecting values for parameters to define an empirical loss function Ff,g,s(D,w), of the form
-
( D , w ) = F f , g , s ( D , w ) = 1 n ∑ i = 1 n f mod ( w T x i y i ) + sg ( w ) , thereby modifying the training error to create a regularized convex optimization problem, said problem being to find a value for w that minimizes F(D,w), where D is a set {xi, yi} of vectors derived from categorized training documents, g is a smooth function expressible as a sum of real-valued functions, each real-valued function being a function of exactly one component of the weight vector w, and s is a small positive real number establishing a balance in the resulting optimization problem between minimizing an error rate on the training data and ensuring weights are not too large.
-
-
5. A method as recited in claim 4, wherein the step of solving the problem by relaxation further comprises the steps of:
-
assembling a set of tokenized representations of a set of training documents with a list L of selected features for the set of training documents for a selected category;
creating a set D of numerical vector representations of the set of training documents using the list L of selected features;
determining a weight vector w that minimizes F(D,w) by using a relaxation algorithm that takes the set D as input; and
providing the determined weight vector w and determined threshold t.
-
-
6. A method as recited in claim 5, wherein the step of creating the set D of numerical vector representations, further comprises the steps of:
-
for each document d in the set D, inputting a tokenized representation of document d, and the list of selected features L for the training data for the selected category;
creating a vector of counts of occurrences of each feature in the set of features L for the inputted document d;
augmenting the vector created in the creating step by adding a final component set to a fixed non-zero value A, and producing a new augmented vector x; and
providing the augmented vector x as a numerical representation of the document d.
-
-
7. A method as recited in claim 6, further comprising the step of transforming each numerical vector representation of a training document d by normalizing each vector or by rounding down each component of a vector to a fixed minimum value.
-
8. A method of supervised learning for the purpose of text categorization, said method comprising the steps of:
-
selecting values for parameters which determine an empirical loss function Ff,g,s(D,w), of the form
thereby modifying by approximation a training error function for a set of n categorized training documents and then regularizing the function thereby approximated to create a regularized convex optimization problem, wherein the training error function to be thereby approximated is defined by
where f is a step function
w is a weight vector, D is a set {xi, yi} of vectors derived from categorized training documents such that x, is a vector derived from an ith training document, yiε
{+1, −
1} is a label designating whether or not the ith document is in a particular category C, wTx is a dot product of w and x, and wherein the training error function thereby modified by approximation for a set of n training documents is defined by
where fmod is a convex approximation of the step function f, and wherein fmod is regularized by the addition of a convex term g, g being a smooth function expressible as a sum of real-valued functions, each real-valued function being a function of exactly one component of the weight vector w, and s is a small positive real, number establishing a balance in the resulting optimization problem between minimizing an error rate on the training data and ensuring weights are not too large;
assembling a set of tokenized representations of a set of training documents with a list L of selected features for the set of training documents for a selected category;
creating a set D of numerical vector representations of the set of training documents using the list L of selected features;
determining a weight vector w that minimizes F(D,w) by using a relaxation algorithm that takes the set D as input, and providing the determined weight vector w by optimizing components of weight vector w one component at a time. - View Dependent Claims (9, 10, 11, 12, 13, 14, 15)
inputting a document d, where d is a member of the training set of documents D;
segmenting the document d into sections, each section being significant for categorization;
generating tokens for each section that contains text; and
providing a tokenized representation r of the document d from which a list of tokens in each section can be determined.
-
-
11. A method as recited in claim 10, further comprising the steps of:
-
assembling a set of tokenized representations including each document d of the training set of documents D; and
selecting a set of features L relevant to document membership in the selected category for the training set.
-
-
12. A method as recited in claim 10, further comprising the step of converting each generated token to a canonical form.
-
13. A method as recited in claim 10, further comprising the step of deleting tokens not useful for categorization.
-
14. A method as recited in claim 8, wherein said relaxation algorithm to determine the weight w comprise the steps of:
-
initializing w=0 and ri=wTxi;
picking a decreasing sequence of 1=c1≧
c2≧
. . . ≧
cK=0for k=1, 2, . . . , K defining function Ck(x)=1 if x≦
1 and Ck(x)=ck, otherwisefor j=1, . . . , d calculating
-
-
15. A method as recited in claim 14, further comprising the step of determining a threshold t to be used in conjunction with the weight vector w in the formulation of a rule for categorizing a document d with respect to membership in a category C such that d is assigned to C if wTx≧
- t or d is not assigned to C if wTx<
t, wherein x is a vector derived from a document d, and wherein said determining is based on obtaining a desired trade-off between precision and recall.
- t or d is not assigned to C if wTx<
-
16. A method for predicting whether a document is a member of a category using a weight vector w, a threshold t, and a set L of selected features that were used in the determination of w, said method comprising the steps of:
-
inputting a document d;
creating a tokenized representation r for document d;
creating a a vector x which represents d by using r and the selected features L;
determining whether for a rule defined as an inner product of w and x being greater than or equal to t, and if so, assigning d to the category corresponding to this rule, but if not, refraining from assigning d to the category corresponding to this rule, wherein w is determined by;
modifying a training error function for a set of n categorized training documents, wherein the training error function to be modified is defined by
where f is a step function
w is a weight vector, xi is a vector derived from an ith training document, yiε
{+1, −
1} is a label designating whether or not the ith document is in a particular category C, wTx is a dot product of w and xi and wherein the modified training error function for a set of n training documents is defined by
where fmod is a convex approximation of the step function f;
deriving a convex optimization problem from the modified training error function by finding a value of weight vector w that minimizes the modified training error function;
regularizing the convex optimization problem by adding a convex term, said convex term being dependent only on weight vector w;
solving the problem by relaxation, thereby obtaining a weight vector w that minimizes the modified training error by optimizing components of weight vector w one component at a time. - View Dependent Claims (17, 18, 19, 20)
inputting a tokenized representation of document d, and the list of selected features L for the training data D for the selected category;
creating a vector of counts of occurrences of each feature in the set of features L for the inputted document d;
augmenting the vector created in the creating step by adding a final component set to a fixed non-zero value A, and producing a new augmented vector x; and
providing the augmented vector x as a numerical representation of the document d.
-
-
19. A method as recited in claim 18, further comprising the step of transforming the vector x created in the creating step by normalizing or rounding down each component to a fixed minimum value.
-
20. A method as recited in claim 16, wherein the step of creating a tokenized representation r for each document d further comprises the steps:
-
segmenting the document d into sections, each section being significant for categorization;
generating tokens for each section that contains text; and
providing a tokenized representation r of the document d from which a list of tokens in each section can be determined.
-
-
21. A system of supervised learning for the purpose of text categorization, comprising:
-
a computing device having memory and means for processing;
means for inputting a set of training documents;
means for modifying a training error for a set of categorized training documents by expressing the training error as a function which describes an ill-posed problem and replacing the ill-posed problem function with a convex function which approximates the ill-posed problem function;
means for deriving a convex optimization problem from the modified training error function by finding a weight vector w whose dot product with vectors derived from the training documents minimizes the modified training error function;
means for regularizing the convex optimization problem by adding a convex term, said convex term being dependent only on weight vector w; and
computer code in said memory implementing the step of solving the problem by relaxation, wherein components of the weight vector w are optimized one at a time, the solution to the problem being available for further processing using processing means in order to categorize text.
-
-
22. A system for predicting whether a document is a member of a category using a weight vector w, a threshold t, and a set L of selected features that were used in the determination of w, comprising:
-
a computing device having memory and means for processing;
means for inputting documents;
means for modifying a training error for a set of categorized training documents by expressing the training error as a function which describes an ill-posed problem and replacing the ill-posed problem function with a convex function which approximates the ill-posed problem function;
means for deriving a convex optimization problem from the modified training error function by finding a weight vector w whose dot product with vectors derived from the training documents minimizes the modified training error function;
means for regularizing the convex optimization problem by adding a convex term, said convex term being dependent only on weight vector w;
ada first computer code section in said memory implementing the step of solving the convex optimization problem by relaxation, wherein components of the weight vector w are optimized one at a time, the solution to the problem being available for further processing using processing means in order to categorize text;
a second computer code section in said memory implementing the step of inputting a document d;
a third computer code section in said memory implementing the step of creating a tokenized representation r for document d;
a fourth computer code section in said memory implementing the step of creating a a vector x which represents d by using r and the selected features L; and
a fifth computer code section in said memory implementing the step of determining whether for a rule defined as an inner product of w and x being greater than or equal to t, and if so, assigning d to the category corresponding to this rule, but if not, refraining from assigning d to the category corresponding to this rule, wherein w is determined by processing the first code section in said memory to solve the problem by relaxation.
-
Specification