Text categorization with knowledge transfer from heterogeneous datasets
First Claim
1. A computer-implemented method for classifying input text data by obtaining information from multiple datasets, the method comprising the steps of:
- receiving, at a computer, the input text data;
accessing, at the computer, a plurality of heterogeneous datasets, the plurality of heterogeneous datasets each including text data;
generating, at the computer, a set of features from the plurality of heterogeneous datasets, the set of features including one or more features from each of the plurality of heterogeneous datasets;
selecting, at the computer, one or more classification features from the set of features; and
generating, at the computer, an augmented input text data by combining the input text data and the one or more classification features; and
applying, at the computer, a classifier to the augmented input text data to associate the input text data with a category.
1 Assignment
0 Petitions
Accused Products
Abstract
The present invention provides a method for incorporating features from heterogeneous auxiliary datasets into input text data for use in classification. Heterogeneous auxiliary datasets, such as labeled datasets and unlabeled datasets, are accessed after receiving input text data. Features are extracted from each of the heterogeneous auxiliary datasets. The features are combined with the input text data to generate a set of features which may potentially be used to classify the input text data. Classification features are then extracted from the set of features and used to classify the input text data. In one embodiment, the classification features are extracted by calculating a mutual information value associated with each feature in the set of features and identifying features having a mutual information value exceeding a threshold value.
38 Citations
24 Claims
-
1. A computer-implemented method for classifying input text data by obtaining information from multiple datasets, the method comprising the steps of:
-
receiving, at a computer, the input text data; accessing, at the computer, a plurality of heterogeneous datasets, the plurality of heterogeneous datasets each including text data; generating, at the computer, a set of features from the plurality of heterogeneous datasets, the set of features including one or more features from each of the plurality of heterogeneous datasets; selecting, at the computer, one or more classification features from the set of features; and generating, at the computer, an augmented input text data by combining the input text data and the one or more classification features; and applying, at the computer, a classifier to the augmented input text data to associate the input text data with a category. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. A computer program product, comprising a computer readable storage medium storing computer executable code for classifying input text data by obtaining information from multiple datasets, the computer executable code when executed by a processor causing a computer to perform the steps of:
-
receiving the input text data; accessing a plurality of heterogeneous datasets, the plurality of heterogeneous datasets each including text data; generating a set of features from the plurality of heterogeneous datasets, the set of features including one or more features from each of the plurality of heterogeneous datasets; selecting one or more classification features from the set of features; generating an augmented input text data by combining the input text data and the one or more classification features; and applying a classifier to the augmented input text data to associate the input text data with a category. - View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24)
-
Specification