Text classification by weighted proximal support vector machine based on positive and negative sample sizes and weights

US 7,707,129 B2
Filed: 03/20/2006
Issued: 04/27/2010
Est. Priority Date: 03/20/2006
Status: Expired due to Fees

First Claim

Patent Images

1. A system for training a text classifier, the system comprising:

a memory storing computer-executable instructions that implement;

a text data preprocessor that preprocesses raw training text to produce an input matrix, the raw training text including documents and indications of whether each document is a positive or a negative training example of a classification; and

a module for solving a weighted proximal support vector machine equation comprising;

a weighting module that generates a weighted matrix by re-weighting the input matrix based on how many training examples are positive and how many training examples are negative wherein the weighting is based on satisfying the following equation;

N₊δ

₊²=N₋δ

₋²where N₊ and N₋ denote numbers of positive and negative training examples and δ

₊ and δ

₋ denote weights of the positive and negative training examples; and

a model-vector generator that iteratively calculates a model vector based on the weighted matrix using a proximal support vector machine model; and

a processor for executing the computer-executable instructions stored in the memory.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Embodiments of the invention relate to improvements to the support vector machine (SVM) classification model. When text data is significantly unbalanced (i.e., positive and negative labeled data are in disproportion), the classification quality of standard SVM deteriorates. Embodiments of the invention are directed to a weighted proximal SVM (WPSVM) model that achieves substantially the same accuracy as the traditional SVM model while requiring significantly less computational time. A weighted proximal SVM (WPSVM) model in accordance with embodiments of the invention may include a weight for each training error and a method for estimating the weights, which automatically solves the unbalanced data problem. And, instead of solving the optimization problem via the KKT (Karush-Kuhn-Tucker) conditions and the Sherman-Morrison-Woodbury formula, embodiments of the invention use an iterative algorithm to solve an unconstrained optimization problem, which makes WPSVM suitable for classifying relatively high dimensional data.

20 Citations

View as Search Results

20 Claims

1. A system for training a text classifier, the system comprising:
- a memory storing computer-executable instructions that implement;
  
  a text data preprocessor that preprocesses raw training text to produce an input matrix, the raw training text including documents and indications of whether each document is a positive or a negative training example of a classification; and
  
  a module for solving a weighted proximal support vector machine equation comprising;
  
  a weighting module that generates a weighted matrix by re-weighting the input matrix based on how many training examples are positive and how many training examples are negative wherein the weighting is based on satisfying the following equation;
  
  N₊δ
  
  ₊²=N₋δ
  
  ₋²where N₊ and N₋ denote numbers of positive and negative training examples and δ
  
  ₊ and δ
  
  ₋ denote weights of the positive and negative training examples; and
  
  a model-vector generator that iteratively calculates a model vector based on the weighted matrix using a proximal support vector machine model; and
  
  a processor for executing the computer-executable instructions stored in the memory.
- View Dependent Claims (2, 3, 4)
- - 2. The system of claim 1, wherein the text data preprocessor uses a vector space model to represent the raw training text.
  - 3. The system of claim 2, wherein the text data preprocessor uses the vector space model to represent labels that correspond to the raw training text.
  - 4. The system of claim 1, wherein the iterative algorithm is a conjugate-gradient method for least squares systems algorithm.

5. A system for classifying text, the system comprising:
- a memory storing computer-executable instructions that implement;
  
  a text data preprocessor that preprocesses raw text to be classified to produce a vector representation of the text, the raw training text including documents and indications of whether each document is a positive or a negative training example of a classification;
  
  a model-vector reader that reads a model vector, the model vector being generated by solving a weighted proximal support vector machine equation bygenerating a weighted matrix by re-weighting the input matrix based on how many training examples are positive and how many training examples are negative wherein the weighting is based on satisfying the following equation;
  
  N₊δ
  
  ₊²=N₋δ
  
  ₋²where N₊ and N₋ denote numbers of positive and negative training examples and δ
  
  ₊ and δ
  
  ₋ denote weights of the positive and negative training examples; and
  
  iteratively calculating a model vector based on the weighted matrix using a proximal support vector machine model; and
  
  a classifier that generates a classification result based on the vector representation of the text and based on the read model vector; and
  
  a processor for executing the computer-executable instructions stored in the memory.
- View Dependent Claims (6, 7, 8, 9)
- - 6. The system of claim 5, wherein the text data preprocessor uses a vector space model to represent the raw text to be classified.
  - 7. The system of claim 6, wherein the classifier calculates an inner product of the vector representation of the text and the read model vector.
  - 8. The system of claim 7, wherein the classification result is based on the calculated inner product.
  - 9. The system of claim 8, wherein:
    - if the calculated inner product is greater than 0, the classification result is +1; and
      
      if the calculated inner product is less than or equal to 0, the classification result is −
      
      1.

10. A computer-readable medium containing computer-executable instructions for training a text classifier and classifying text by performing steps comprising:
- representing input training text as a sparse matrix, the raw training text including documents and indications of whether each document is a positive or a negative training example of a classification;
  
  setting a plurality of classifier-training parameters;
  
  iteratively solving a weighted proximal support vector machine equation bygenerating a weighted matrix by re-weighting the input matrix based on how many training examples are positive and how many training examples are negative wherein the weighting is based on satisfying the following equation;
  
  N₊δ
  
  ₊²=N₋δ
  
  ₋²where N₊ and N₋ denote numbers of positive and negative training examples and δ
  
  ₊ and δ
  
  ₋ denote weights of the positive and negative training examples; and
  
  iteratively calculating a model vector based on the weighted matrix using a proximal support vector machine model; and
  
  predicting respective classes for a plurality of test examples.
- View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
- - 11. The computer-readable medium of claim 10, wherein the sparse matrix is an m by n sparse matrix X, where m is a number of training examples and n is a dimension of a training vector.
  - 12. The computer-readable medium of claim 11, containing further computer-executable instruction for performing steps comprising:
    - using <
      
      x₁, y₁>
      
      to represent each training data sample, where x_iε
      
      Rⁿ, i=1, 2, . . . , m are training vectors and y_iε
      
      {+1,−
      
      1}, i=1, 2, . . . , m are their corresponding class labels.
  - 13. The computer-readable medium of claim 12, wherein setting the classifier-training parameters includes setting δ
    - ₋=1.
  - 14. The computer-readable medium of claim 13, wherein setting the classifier-training parameters includes setting ratio=√
    - {square root over (N₋/N₊)}, where N₊ denotes a number of positive training examples and N₋ denotes a number of negative training examples.
  - 15. The computer-readable medium of claim 14, wherein setting the classifier-training parameters includes setting δ
    - ₊=1+(ratio−
      
      1)/2.
  - 16. The computer-readable medium of claim 15, wherein setting the classifier-training parameters includes setting δ
    - _i=δ
      
      ₊ for positive training examples.
  - 17. The computer-readable medium of claim 16, wherein setting the classifier-training parameters includes setting δ
    - _i=δ
      
      ₋ for negative training examples.
  - 18. The computer-readable medium of claim 17, wherein setting the classifier-training parameters includes setting v=2*average(δ
    - _i∥
      
      x_i∥
      
      ).
  - 19. The computer-readable medium of claim 18, wherein iteratively solving the weighted proximal support vector machine equation includes:
    - A=[X,e], where each element in vector e is 1;
      
      letting Δ
      
      ε
      
      R^m×
      
      mdenote a diagonal matrix whose non-zero elements are Δ
      
      _ii=δ
      
      _iandsolving (vI+(Δ
      
      A)^T(Δ
      
      A))β
      
      =(Δ
      
      A)^T(Δ
      
      y) using a conjugate-gradient method for least squares systems algorithm.
  - 20. The computer-readable medium of claim 19, wherein predicting the respective classes for the plurality of test examples includes predicting the class of test example X_ibased on whether β
    - ·
      
      (x_i,1) is positive or negative.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Zhuang, Dong, Chen, Zheng, Zhang, Benyu, Wang, Jian, Zeng, Hua-Jun
Primary Examiner(s)
Sparks; Donald
Assistant Examiner(s)
CHANG, LI WU

Application Number

US11/384,889
Publication Number

US 20070239638A1
Time in Patent Office

1,499 Days
Field of Search

706/20
US Class Current

706/20
CPC Class Codes

G06F 16/353 into predefined classes

G06F 18/2411 based on the proximity to a...

Text classification by weighted proximal support vector machine based on positive and negative sample sizes and weights

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

20 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Text classification by weighted proximal support vector machine based on positive and negative sample sizes and weights

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

20 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links