System and method of feature selection for text classification using subspace sampling

US 8,046,317 B2
Filed: 12/31/2007
Issued: 10/25/2011
Est. Priority Date: 12/31/2007
Status: Active Grant

First Claim

Patent Images

1. A computer system for classification, comprising:

a processor device configured to operate as a text classifier using a plurality of features selected by subspace sampling from a corpus of training data for classification of a document;

selecting a subset from the plurality of features by subspace sampling comprises using a probability distribution over a plurality of features from a corpus of training texts, the probability distribution having a probability assigned to each of the plurality of features that is proportional to a square of Euclidean norms of a plurality of rows of a plurality of left singular vectors of a matrix of the plurality of features representing the corpus of training texts; and

a storage operably coupled to the text classifier for storing a plurality of texts classified using the plurality of features selected by subspace sampling into a plurality of classes.

View all claims

9 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An improved system and method is provided for feature selection for text classification using subspace sampling. A text classifier generator may be provided for selecting a small set of features using subspace sampling from the corpus of training data to train a text classifier for using the small set of features for classification of texts. To select the small set of features, a subspace of features from the corpus of training data may be randomly sampled according to a probability distribution over the set of features where a probability may be assigned to each of the features that is proportional to the square of the Euclidean norms of the rows of left singular vectors of a matrix of the features representing the corpus of training texts. The small set of features may classify texts using only the relevant features among a very large number of training features.

Citations

12 Claims

1. A computer system for classification, comprising:
- a processor device configured to operate as a text classifier using a plurality of features selected by subspace sampling from a corpus of training data for classification of a document;
  
  selecting a subset from the plurality of features by subspace sampling comprises using a probability distribution over a plurality of features from a corpus of training texts, the probability distribution having a probability assigned to each of the plurality of features that is proportional to a square of Euclidean norms of a plurality of rows of a plurality of left singular vectors of a matrix of the plurality of features representing the corpus of training texts; and
  
  a storage operably coupled to the text classifier for storing a plurality of texts classified using the plurality of features selected by subspace sampling into a plurality of classes.
- View Dependent Claims (2, 3)
- - 2. The system of claim 1 further comprising a text classifier generator operably coupled to the storage for learning a classification function for each of the plurality of classes to train the text classifier for using the plurality of features selected by subspace sampling for classification of the document.
  - 3. The system of claim 1 further comprising a feature selector using subspace sampling operably coupled to the text classifier generator for selecting the plurality of features using subspace sampling from the corpus of training data for classification of the document.

4. A computer-implemented method for classification, comprising:
- using an input/output device receiving a text represented by a plurality of features for classification; and
  
  using a processor device configured to perform;
  
  selecting a subset from the plurality of features by subspace sampling using a probability distribution over a plurality of features from a corpus of training texts, the probability distribution having a probability assigned to each of the plurality of features that is proportional to a square of Euclidean norms of a plurality of rows of a plurality of left singular vectors of a matrix of the plurality of features representing the corpus of training texts;
  
  classifying the text using the subset of the plurality of features; and
  
  outputting the classification of the text classified using the subset of the plurality of features selected by subspace sampling.
- View Dependent Claims (5, 6, 7, 8, 9, 10, 11)
- - 5. The method of claim 4 further comprising receiving a corpus of classified training texts represented by a plurality of features for classification of a document.
  - 6. The method of claim 4 further comprising outputting the subset of the plurality of features.
  - 7. The method of claim 4 wherein selecting the subset of the plurality of features from the plurality of features by subspace sampling comprises randomly sampling a subspace of the plurality of features using a probability distribution with a probability assigned to each of the plurality of features that is proportional to a square of Euclidean norms of a plurality of rows of a plurality of left singular vectors of a matrix of the plurality of features representing the corpus of training texts.
  - 8. The method of claim 4 wherein selecting the subset of the plurality of features from the plurality of features by subspace sampling comprises selecting the subset of the features from the randomly sampled subspace of the plurality of features using a probability distribution with a probability assigned to each of the plurality of features that is proportional to a square of Euclidean norms of a plurality of rows of a plurality of left singular vectors of a matrix of the plurality of features representing the corpus of training texts.
  - 9. The method of claim 4 wherein selecting the subset of the plurality of features from the plurality of features by subspace sampling further comprises defining a kernel matrix over the subset of the plurality of features selected from a sampled subspace of the plurality of features.
  - 10. The method of claim 4 wherein selecting the subset of the plurality of features from the plurality of features by subspace sampling further comprises determining an optimal vector representing the subset of the plurality of features that characterize a classification function using a kernel matrix defined over the subset of the plurality of features.
  - 11. The method of claim 4 further comprising storing an association of the text and a class.

12. A non-transitory computer-readable medium having computer executable instructions for performing steps of:
- receiving a text represented by a plurality of features for classification;
  
  selecting a subset from the plurality of features by subspace sampling using aprobability distribution over a plurality of features from a corpus of training texts, the probability distribution having a probability assigned to each of the plurality of features that is proportional to a square of Euclidean norms of a plurality of rows of a plurality of left singular vectors of a matrix of the plurality of features representing the corpus of training texts;
  
  classifying the text using the subset of the plurality of features;
  
  storing a plurality of texts classified using the plurality of features selected by subspace sampling into a plurality of classes; and
  
  outputting the classification of the text classified using the subset of the plurality of features selected by subspace sampling.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
R2 Solutions LLC (Acacia Research Corporation)
Original Assignee
Yahoo! Inc. (Apollo Global Management, Inc.)
Inventors
Harb, Boulos, Mahoney, Michael William, Drineas, Petros, Josifovski, Vanja, Dasgupta, Anirban
Primary Examiner(s)
Gaffin; Jeffrey A
Assistant Examiner(s)
Wong; Lut

Application Number

US12/006,178
Publication Number

US 20090171870A1
Time in Patent Office

1,394 Days
Field of Search

706/12, 706/45
US Class Current

706/45
CPC Class Codes

G06F 18/2115   by evaluating different sub...

G06F 18/214   Generating training pattern...

G06F 18/2411   based on the proximity to a...

System and method of feature selection for text classification using subspace sampling

First Claim

9 Assignments

0 Petitions

Accused Products

Abstract

Citations

12 Claims

Specification

Solutions

Use Cases

Quick Links

System and method of feature selection for text classification using subspace sampling

First Claim

9 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

12 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links