System and method of feature selection for text classification using subspace sampling
First Claim
1. A computer system for classification, comprising:
- a processor device configured to operate as a text classifier using a plurality of features selected by subspace sampling from a corpus of training data for classification of a document;
selecting a subset from the plurality of features by subspace sampling comprises using a probability distribution over a plurality of features from a corpus of training texts, the probability distribution having a probability assigned to each of the plurality of features that is proportional to a square of Euclidean norms of a plurality of rows of a plurality of left singular vectors of a matrix of the plurality of features representing the corpus of training texts; and
a storage operably coupled to the text classifier for storing a plurality of texts classified using the plurality of features selected by subspace sampling into a plurality of classes.
9 Assignments
0 Petitions
Accused Products
Abstract
An improved system and method is provided for feature selection for text classification using subspace sampling. A text classifier generator may be provided for selecting a small set of features using subspace sampling from the corpus of training data to train a text classifier for using the small set of features for classification of texts. To select the small set of features, a subspace of features from the corpus of training data may be randomly sampled according to a probability distribution over the set of features where a probability may be assigned to each of the features that is proportional to the square of the Euclidean norms of the rows of left singular vectors of a matrix of the features representing the corpus of training texts. The small set of features may classify texts using only the relevant features among a very large number of training features.
-
Citations
12 Claims
-
1. A computer system for classification, comprising:
-
a processor device configured to operate as a text classifier using a plurality of features selected by subspace sampling from a corpus of training data for classification of a document; selecting a subset from the plurality of features by subspace sampling comprises using a probability distribution over a plurality of features from a corpus of training texts, the probability distribution having a probability assigned to each of the plurality of features that is proportional to a square of Euclidean norms of a plurality of rows of a plurality of left singular vectors of a matrix of the plurality of features representing the corpus of training texts; and a storage operably coupled to the text classifier for storing a plurality of texts classified using the plurality of features selected by subspace sampling into a plurality of classes. - View Dependent Claims (2, 3)
-
-
4. A computer-implemented method for classification, comprising:
-
using an input/output device receiving a text represented by a plurality of features for classification; and using a processor device configured to perform; selecting a subset from the plurality of features by subspace sampling using a probability distribution over a plurality of features from a corpus of training texts, the probability distribution having a probability assigned to each of the plurality of features that is proportional to a square of Euclidean norms of a plurality of rows of a plurality of left singular vectors of a matrix of the plurality of features representing the corpus of training texts; classifying the text using the subset of the plurality of features; and outputting the classification of the text classified using the subset of the plurality of features selected by subspace sampling. - View Dependent Claims (5, 6, 7, 8, 9, 10, 11)
-
-
12. A non-transitory computer-readable medium having computer executable instructions for performing steps of:
-
receiving a text represented by a plurality of features for classification; selecting a subset from the plurality of features by subspace sampling using a probability distribution over a plurality of features from a corpus of training texts, the probability distribution having a probability assigned to each of the plurality of features that is proportional to a square of Euclidean norms of a plurality of rows of a plurality of left singular vectors of a matrix of the plurality of features representing the corpus of training texts; classifying the text using the subset of the plurality of features; storing a plurality of texts classified using the plurality of features selected by subspace sampling into a plurality of classes; and outputting the classification of the text classified using the subset of the plurality of features selected by subspace sampling.
-
Specification