Text categorization toolkit
First Claim
1. A module information extraction system capable of extracting information from natural language documents, the system including a plurality of interchangeable modules, the system comprising:
- a data preparation module for preparing a first set of raw data having class labels to be tested, the data preparation module being selected from a first type of the interchangeable modules;
a feature extraction module for extracting features from the raw data received from the data preparation module and storing the features in a vector format, the feature extraction module being selected from a second type of the interchangeable modules;
a core classification module for applying a learning algorithm to the stored vector format and producing therefrom a resulting classifier, the core classification module being selected from a third type of the interchangeable modules; and
a testing module for comparing the resulting classifier to a set of preassigned classes, the testing module being selected from a fourth type of the interchangeable modules, wherein the testing module tests a second set of raw data having class labels received by the data preparation module to determine whether the class labels of the second set of raw corresponds to the resulting classifier.
1 Assignment
0 Petitions
Accused Products
Abstract
A module information extraction system capable of extracting information from natural language documents. The system includes a plurality of interchangeable modules including a data preparation module for preparing a first set of raw data having class labels to be tested, the data preparation module being selected from a first type of the interchangeable modules. The system further includes a feature extraction module for extracting features from the raw data received from the data preparation module and storing the features in a vector format, the feature extraction module being selected from a second type of the interchangeable modules. A core classification module is also provided for applying a learning algorithm to the stored vector format and producing therefrom a resulting classifier, the core classification module being selected from a third type of the interchangeable modules. A testing module compares the resulting classifier to a set of preassigned classes, where the testing module is selected from a fourth type of the interchangeable modules, where the testing module tests a second set of raw data having class labels received by the data preparation module to determine the degree to which the class labels of the second set of raw data approximately corresponds to the resulting classifier.
-
Citations
16 Claims
-
1. A module information extraction system capable of extracting information from natural language documents, the system including a plurality of interchangeable modules, the system comprising:
-
a data preparation module for preparing a first set of raw data having class labels to be tested, the data preparation module being selected from a first type of the interchangeable modules;
a feature extraction module for extracting features from the raw data received from the data preparation module and storing the features in a vector format, the feature extraction module being selected from a second type of the interchangeable modules;
a core classification module for applying a learning algorithm to the stored vector format and producing therefrom a resulting classifier, the core classification module being selected from a third type of the interchangeable modules; and
a testing module for comparing the resulting classifier to a set of preassigned classes, the testing module being selected from a fourth type of the interchangeable modules, wherein the testing module tests a second set of raw data having class labels received by the data preparation module to determine whether the class labels of the second set of raw corresponds to the resulting classifier. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
the data preparation module splits the data randomly according to a user specification into the first set of raw data and the second set of raw data, and the first set of raw data is submitted to the feature extraction module and the second set of data is submitted to the testing module.
-
-
4. The module information extraction system of claim 1, further comprising tool means for assisting with the preparation of the raw data for training, wherein the tool means is selected from a fifth type of the interchangeable modules and includes at least (i) filtering means for filtering out documents from the raw data before training and (ii) random splitting means for randomly splitting the raw data into a training set and a testing set of raw data.
-
5. The module information extraction system of claim 1, wherein the feature extraction module includes:
-
a feature definition module for extracting the first set of raw data received from the data preparation module and preparing a feature count table in vector format; and
a feature selection module for reducing or altering the feature count table in vector format based on user specified selection criteria, wherein the feature definition module and the feature selection module are individually driven.
-
-
6. The module information extraction system of claim 5, wherein:
-
the feature selection module consults the feature count table in vector format when features of the first set of raw data are selected by the user and prepares the reduced feature count table in vector format based on the user specified selection criteria, and the core classification module consults the reduced feature count table in vector format when applying the learning algorithm.
-
-
7. The module information extraction system of claim 5, wherein:
-
the testing module tests the second set of raw data having class labels received by the data preparation module to determine whether the class labels of the second set of raw data corresponds to the resulting classifier, and when the resulting classifier does not approximately coincide with the set of preassigned classes in the second set of raw data, the reduced feature count table is update by user specified selection criteria and the core classification module reapplies the learning algorithm to the updated reduced feature count table in vector format in order for the resulting classifier to approximately coincide with the set of preassigned classes in the second set of data.
-
-
8. The module information extraction system of claim 5, wherein the feature definition module further performs stemming, abbreviation expansion, names or term extraction, thereby defining the features to be used.
-
9. The module information extraction system of claim 1, wherein the feature extraction module further counts the frequencies of features in the raw data and provides the frequencies of the features in a feature count table.
-
10. The module information extraction system of claim 9, wherein the feature count table provides information so that successive training does not have to revisit the raw data.
-
11. The module information extraction system of claim 1, further comprising configuration means for storing documentation on how the classifiers are created by the core classification module in order to generate and retrieve a feature count table file.
-
12. A method for extracting information from natural language documents, comprising:
-
preparing a first set of raw data having class labels to be tested;
extracting features from the raw data and storing the features in a vector format;
applying a learning algorithm to the stored vector format and producing therefrom a resulting classifier;
comparing the resulting classifier to a set of preassigned classes;
testing a second set of raw data having class labels to determine whether the class labels of the second set of raw data corresponds to the resulting classifier in the second set of raw data; and
when the resulting classifier does not coincide with the set of preassigned classes in the second set of raw data, updating the reduced feature count table by user specified selection criteria and reapplying the learning algorithm to the updated reduced feature count table in vector format in order for the resulting classifier to approximately coincide with the set of preassigned classes in the second set of data. - View Dependent Claims (13, 14, 15, 16)
splitting the data randomly according to a user specification into the first set of raw data and the second set of raw data, submitting the first set of raw data to a feature extraction module for feature extraction, and submitting the second set of data to a testing module for the testing.
-
-
14. The method of claim 12, wherein preparing the vector format includes:
-
extracting the first set of raw data received from the data preparation module and preparing a feature count table in vector format; and
preparing a reduced feature count table in vector format based on the user specified selection criteria, wherein the extracting and preparing are individually driven.
-
-
15. The method of claim 14, further comprising consulting the reduced feature count table in vector format when applying the learning algorithm.
-
16. The method of claim 14, wherein the feature count table in vector format is prepared one time and separate reduced feature count tables in vector format are prepared based on the feature count table in vector format when the resulting classifier does not approximately coincide with the set of preassigned classes.
Specification