Text categorizers based on regularizing adaptations of the problem of computing linear separators

US 6,571,225 B1
Filed: 02/11/2000
Issued: 05/27/2003
Est. Priority Date: 02/11/2000
Status: Expired due to Term

First Claim

Patent Images

1. A method of supervised learning for the purpose of text categorization, said method comprising the steps of:

modifying training error for a set of categorized training documents by expressing the training error as a function which describes an ill-posed problem and replacing the ill-posed problem function with a convex function which approximates the ill-posed problem function;

deriving a convex optimization problem from the modified training error function by finding a weight vector w whose dot product with vectors derived from the training documents minimizes the modified training error function;

regularizing the convex optimization problem by adding a convex term, said convex term being dependent only on weight vector w;

solving the problem by relaxation, wherein components of the weight vector w are optimized one component at a time.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method to automatically categorize messages or documents containing text. The method of solution fits in the general framework of supervised learning, in which a rule or rules for categorizing data is automatically constructed by a computer on the basis of training data that has been labeled beforehand. More specifically, the method involves the construction of a linear separator: training data is used to construct for each category a weight vector w and a threshold t, and the decision of whether a hitherto unseen document d is in the category will depend on the outcome of the test w^Tx≧t, where x is a vector derived from the document d. The method also uses a set L of features selected from the training data in order to construct the numerical vector representation x of a document. The preferred method uses an algorithm based on Gauss-Seidel iteration to determine the weight factor w that is determined by a regularized convex optimization problem derived from the principle of minimizing modified training error.

Citations

22 Claims

1. A method of supervised learning for the purpose of text categorization, said method comprising the steps of:
- modifying training error for a set of categorized training documents by expressing the training error as a function which describes an ill-posed problem and replacing the ill-posed problem function with a convex function which approximates the ill-posed problem function;
  
  deriving a convex optimization problem from the modified training error function by finding a weight vector w whose dot product with vectors derived from the training documents minimizes the modified training error function;
  
  regularizing the convex optimization problem by adding a convex term, said convex term being dependent only on weight vector w;
  
  solving the problem by relaxation, wherein components of the weight vector w are optimized one component at a time.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. A method as recited in claim 1, further comprising the step of determining a threshold t to be used in conjunction with the weight vector w in formulating a rule for categorizing a document d with respect to membership in a category C such that d is assigned to C if w^Tx≧
    - t or d is not assigned to C if w^Tx<
      
      t, wherein x is a vector derived from a document d, and wherein said determining is based on obtaining a desired trade-off between precision and recall, and wherein w^Tx is a dot product of w and x.
  - 3. A method as recited in claim 1, wherein the training error for a set of n training documents is defined by $\frac{1}{n}$
    - ∑
      
      i=1n
      
      
      
      f
      
      (wT
      
      xi
      
      yi),
4. A method as recited in claim 3, further comprising the step of selecting values for parameters to define an empirical loss function F_f,g,s(D,w), of the form $F$
- (D,w)=Ff,g,s
  
  (D,w)=1n
  
  ∑
  
  i=1n
  
  
  
  fmod
  
  (wT
  
  xi
  
  yi)+sg
  
  (w),thereby modifying the training error to create a regularized convex optimization problem, said problem being to find a value for w that minimizes F(D,w), where D is a set {x_i, y_i} of vectors derived from categorized training documents, g is a smooth function expressible as a sum of real-valued functions, each real-valued function being a function of exactly one component of the weight vector w, and s is a small positive real number establishing a balance in the resulting optimization problem between minimizing an error rate on the training data and ensuring weights are not too large.
5. A method as recited in claim 4, wherein the step of solving the problem by relaxation further comprises the steps of:
- assembling a set of tokenized representations of a set of training documents with a list L of selected features for the set of training documents for a selected category;
  
  creating a set D of numerical vector representations of the set of training documents using the list L of selected features;
  
  determining a weight vector w that minimizes F(D,w) by using a relaxation algorithm that takes the set D as input; and
  
  providing the determined weight vector w and determined threshold t.
6. A method as recited in claim 5, wherein the step of creating the set D of numerical vector representations, further comprises the steps of:
- for each document d in the set D, inputting a tokenized representation of document d, and the list of selected features L for the training data for the selected category;
  
  creating a vector of counts of occurrences of each feature in the set of features L for the inputted document d;
  
  augmenting the vector created in the creating step by adding a final component set to a fixed non-zero value A, and producing a new augmented vector x; and
  
  providing the augmented vector x as a numerical representation of the document d.
7. A method as recited in claim 6, further comprising the step of transforming each numerical vector representation of a training document d by normalizing each vector or by rounding down each component of a vector to a fixed minimum value.

8. A method of supervised learning for the purpose of text categorization, said method comprising the steps of:
- selecting values for parameters which determine an empirical loss function F_f,g,s(D,w), of the form $F (D, w) = F_{f, g, s} (D, w) = \frac{1}{n} \sum_{i = 1}^{n} f_{\mod} (w^{T} x_{i} y_{i}) + sg (w),$
  
  thereby modifying by approximation a training error function for a set of n categorized training documents and then regularizing the function thereby approximated to create a regularized convex optimization problem, wherein the training error function to be thereby approximated is defined by $\frac{1}{n} \sum_{i = 1}^{n} f (w^{T} x_{i} y_{i}),$
  
  where f is a step function $f (x) = {\begin{matrix} 1 & if x \leq 0, \\ 0 & if x > 0 \end{matrix}$
  
  w is a weight vector, D is a set {x_i, y_i} of vectors derived from categorized training documents such that x, is a vector derived from an i^thtraining document, y_iε
  
  {+1, −
  
  1} is a label designating whether or not the i^thdocument is in a particular category C, w^Tx is a dot product of w and x, and wherein the training error function thereby modified by approximation for a set of n training documents is defined by $\frac{1}{n} \sum_{i = 1}^{n} f_{\mod} (w^{T} x_{i} y_{i}),$
  
  where f_modis a convex approximation of the step function f, and wherein f_modis regularized by the addition of a convex term g, g being a smooth function expressible as a sum of real-valued functions, each real-valued function being a function of exactly one component of the weight vector w, and s is a small positive real, number establishing a balance in the resulting optimization problem between minimizing an error rate on the training data and ensuring weights are not too large;
  
  assembling a set of tokenized representations of a set of training documents with a list L of selected features for the set of training documents for a selected category;
  
  creating a set D of numerical vector representations of the set of training documents using the list L of selected features;
  
  determining a weight vector w that minimizes F(D,w) by using a relaxation algorithm that takes the set D as input, and providing the determined weight vector w by optimizing components of weight vector w one component at a time.
- View Dependent Claims (9, 10, 11, 12, 13, 14, 15)
- - 9. A method as recited in claim 8, further comprising the step of determining a threshold t to be used in conjunction with the weight vector w in the formulation of a rule for categorizing a document d with respect to membership in a category C such that d is assigned to C if w^Tx≧
    - t or d is not assigned to C if w^Tx<
      
      t, wherein x is a vector derived from a document d, and wherein said determining is based on obtaining a desired trade-off between precision and recall.
  - 10. A method as recited in claim 8, wherein the assembling step further comprises the steps of:
11. A method as recited in claim 10, further comprising the steps of:
- assembling a set of tokenized representations including each document d of the training set of documents D; and
  
  selecting a set of features L relevant to document membership in the selected category for the training set.
12. A method as recited in claim 10, further comprising the step of converting each generated token to a canonical form.
13. A method as recited in claim 10, further comprising the step of deleting tokens not useful for categorization.
14. A method as recited in claim 8, wherein said relaxation algorithm to determine the weight w comprise the steps of:
- initializing w=0 and r_i=w^Tx_i;
  
  picking a decreasing sequence of 1=c₁≧
  
  c₂≧
  
  . . . ≧
  
  c_K=0 for k=1, 2, . . . , K defining function C_k(x)=1 if x≦
  
  1 and C_k(x)=c_k, otherwise for j=1, . . . , d calculating
15. A method as recited in claim 14, further comprising the step of determining a threshold t to be used in conjunction with the weight vector w in the formulation of a rule for categorizing a document d with respect to membership in a category C such that d is assigned to C if w^Tx≧
- t or d is not assigned to C if w^Tx<
  
  t, wherein x is a vector derived from a document d, and wherein said determining is based on obtaining a desired trade-off between precision and recall.

16. A method for predicting whether a document is a member of a category using a weight vector w, a threshold t, and a set L of selected features that were used in the determination of w, said method comprising the steps of:
- inputting a document d;
  
  creating a tokenized representation r for document d;
  
  creating a a vector x which represents d by using r and the selected features L;
  
  determining whether for a rule defined as an inner product of w and x being greater than or equal to t, and if so, assigning d to the category corresponding to this rule, but if not, refraining from assigning d to the category corresponding to this rule, wherein w is determined by;
  
  modifying a training error function for a set of n categorized training documents, wherein the training error function to be modified is defined by $\frac{1}{n} \sum_{i = 1}^{n} f (w^{T} x_{i} y_{i}),$
  
  where f is a step function $f (x) = {\begin{matrix} 1 & if x \leq 0, \\ 0 & if x > 0 \end{matrix}$
  
  w is a weight vector, x_iis a vector derived from an i^thtraining document, y_iε
  
  {+1, −
  
  1} is a label designating whether or not the i^thdocument is in a particular category C, w^Tx is a dot product of w and x_iand wherein the modified training error function for a set of n training documents is defined by $\frac{1}{n} \sum_{i = 1}^{n} f_{\mod} (w^{T} x_{i} y_{i}),$
  
  where f_modis a convex approximation of the step function f;
  
  deriving a convex optimization problem from the modified training error function by finding a value of weight vector w that minimizes the modified training error function;
  
  regularizing the convex optimization problem by adding a convex term, said convex term being dependent only on weight vector w;
  
  solving the problem by relaxation, thereby obtaining a weight vector w that minimizes the modified training error by optimizing components of weight vector w one component at a time.
- View Dependent Claims (17, 18, 19, 20)
- - 17. A method as recited in claim 16, wherein a value for the threshold t is to be used in conjunction with the weight vector w in the formulation of a rule for categorizing a document d with respect to membership in a category C such that d is assigned to C if w^Tx≧
    - t or d is not assigned to C if w^Tx<
      
      t, wherein x is a vector derived from a document d, and wherein said determining is based on obtaining a desired trade-off between precision and recall.
  - 18. A method as recited in claim 16, wherein the step of creating a vector x further comprises the steps of:
19. A method as recited in claim 18, further comprising the step of transforming the vector x created in the creating step by normalizing or rounding down each component to a fixed minimum value.
20. A method as recited in claim 16, wherein the step of creating a tokenized representation r for each document d further comprises the steps:
- segmenting the document d into sections, each section being significant for categorization;
  
  generating tokens for each section that contains text; and
  
  providing a tokenized representation r of the document d from which a list of tokens in each section can be determined.

21. A system of supervised learning for the purpose of text categorization, comprising:
- a computing device having memory and means for processing;
  
  means for inputting a set of training documents;
  
  means for modifying a training error for a set of categorized training documents by expressing the training error as a function which describes an ill-posed problem and replacing the ill-posed problem function with a convex function which approximates the ill-posed problem function;
  
  means for deriving a convex optimization problem from the modified training error function by finding a weight vector w whose dot product with vectors derived from the training documents minimizes the modified training error function;
  
  means for regularizing the convex optimization problem by adding a convex term, said convex term being dependent only on weight vector w; and
  
  computer code in said memory implementing the step of solving the problem by relaxation, wherein components of the weight vector w are optimized one at a time, the solution to the problem being available for further processing using processing means in order to categorize text.

22. A system for predicting whether a document is a member of a category using a weight vector w, a threshold t, and a set L of selected features that were used in the determination of w, comprising:
- a computing device having memory and means for processing;
  
  means for inputting documents;
  
  means for modifying a training error for a set of categorized training documents by expressing the training error as a function which describes an ill-posed problem and replacing the ill-posed problem function with a convex function which approximates the ill-posed problem function;
  
  means for deriving a convex optimization problem from the modified training error function by finding a weight vector w whose dot product with vectors derived from the training documents minimizes the modified training error function;
  
  means for regularizing the convex optimization problem by adding a convex term, said convex term being dependent only on weight vector w;
  
  ad a first computer code section in said memory implementing the step of solving the convex optimization problem by relaxation, wherein components of the weight vector w are optimized one at a time, the solution to the problem being available for further processing using processing means in order to categorize text;
  
  a second computer code section in said memory implementing the step of inputting a document d;
  
  a third computer code section in said memory implementing the step of creating a tokenized representation r for document d;
  
  a fourth computer code section in said memory implementing the step of creating a a vector x which represents d by using r and the selected features L; and
  
  a fifth computer code section in said memory implementing the step of determining whether for a rule defined as an inner product of w and x being greater than or equal to t, and if so, assigning d to the category corresponding to this rule, but if not, refraining from assigning d to the category corresponding to this rule, wherein w is determined by processing the first code section in said memory to solve the problem by relaxation.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
International Business Machines Corporation
Inventors
Zhang, Tong, Oles, Frank J.
Primary Examiner(s)
Follansbee, John
Assistant Examiner(s)
Hirl, Joseph P.

Application Number

US09/502,578
Time in Patent Office

1,201 Days
Field of Search

706/14, 706/12, 706/45, 707/7
US Class Current

706/12
CPC Class Codes

G06F 18/2451   linear, e.g. hyperplane

G06N 20/00   Machine learning

Y10S 707/99937   Sorting

Text categorizers based on regularizing adaptations of the problem of computing linear separators

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

22 Claims

Specification

Solutions

Use Cases

Quick Links

Text categorizers based on regularizing adaptations of the problem of computing linear separators

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

22 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links