Text classification by weighted proximal support vector machine based on positive and negative sample sizes and weights
First Claim
1. A system for training a text classifier, the system comprising:
- a memory storing computer-executable instructions that implement;
a text data preprocessor that preprocesses raw training text to produce an input matrix, the raw training text including documents and indications of whether each document is a positive or a negative training example of a classification; and
a module for solving a weighted proximal support vector machine equation comprising;
a weighting module that generates a weighted matrix by re-weighting the input matrix based on how many training examples are positive and how many training examples are negative wherein the weighting is based on satisfying the following equation;
N+δ
+2=N−
δ
−
2 where N+ and N−
denote numbers of positive and negative training examples and δ
+ and δ
−
denote weights of the positive and negative training examples; and
a model-vector generator that iteratively calculates a model vector based on the weighted matrix using a proximal support vector machine model; and
a processor for executing the computer-executable instructions stored in the memory.
2 Assignments
0 Petitions
Accused Products
Abstract
Embodiments of the invention relate to improvements to the support vector machine (SVM) classification model. When text data is significantly unbalanced (i.e., positive and negative labeled data are in disproportion), the classification quality of standard SVM deteriorates. Embodiments of the invention are directed to a weighted proximal SVM (WPSVM) model that achieves substantially the same accuracy as the traditional SVM model while requiring significantly less computational time. A weighted proximal SVM (WPSVM) model in accordance with embodiments of the invention may include a weight for each training error and a method for estimating the weights, which automatically solves the unbalanced data problem. And, instead of solving the optimization problem via the KKT (Karush-Kuhn-Tucker) conditions and the Sherman-Morrison-Woodbury formula, embodiments of the invention use an iterative algorithm to solve an unconstrained optimization problem, which makes WPSVM suitable for classifying relatively high dimensional data.
20 Citations
20 Claims
-
1. A system for training a text classifier, the system comprising:
-
a memory storing computer-executable instructions that implement; a text data preprocessor that preprocesses raw training text to produce an input matrix, the raw training text including documents and indications of whether each document is a positive or a negative training example of a classification; and a module for solving a weighted proximal support vector machine equation comprising; a weighting module that generates a weighted matrix by re-weighting the input matrix based on how many training examples are positive and how many training examples are negative wherein the weighting is based on satisfying the following equation;
N+δ
+2=N−
δ
−
2where N+ and N−
denote numbers of positive and negative training examples and δ
+ and δ
−
denote weights of the positive and negative training examples; anda model-vector generator that iteratively calculates a model vector based on the weighted matrix using a proximal support vector machine model; and a processor for executing the computer-executable instructions stored in the memory. - View Dependent Claims (2, 3, 4)
-
-
5. A system for classifying text, the system comprising:
-
a memory storing computer-executable instructions that implement; a text data preprocessor that preprocesses raw text to be classified to produce a vector representation of the text, the raw training text including documents and indications of whether each document is a positive or a negative training example of a classification; a model-vector reader that reads a model vector, the model vector being generated by solving a weighted proximal support vector machine equation by generating a weighted matrix by re-weighting the input matrix based on how many training examples are positive and how many training examples are negative wherein the weighting is based on satisfying the following equation;
N+δ
+2=N−
δ
−
2where N+ and N−
denote numbers of positive and negative training examples and δ
+ and δ
−
denote weights of the positive and negative training examples; anditeratively calculating a model vector based on the weighted matrix using a proximal support vector machine model; and a classifier that generates a classification result based on the vector representation of the text and based on the read model vector; and a processor for executing the computer-executable instructions stored in the memory. - View Dependent Claims (6, 7, 8, 9)
-
-
10. A computer-readable medium containing computer-executable instructions for training a text classifier and classifying text by performing steps comprising:
-
representing input training text as a sparse matrix, the raw training text including documents and indications of whether each document is a positive or a negative training example of a classification; setting a plurality of classifier-training parameters; iteratively solving a weighted proximal support vector machine equation by generating a weighted matrix by re-weighting the input matrix based on how many training examples are positive and how many training examples are negative wherein the weighting is based on satisfying the following equation;
N+δ
+2=N−
δ
−
2where N+ and N−
denote numbers of positive and negative training examples and δ
+ and δ
−
denote weights of the positive and negative training examples; anditeratively calculating a model vector based on the weighted matrix using a proximal support vector machine model; and predicting respective classes for a plurality of test examples. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
-
Specification