Text categorization toolkit

US 6,212,532 B1
Filed: 10/22/1998
Issued: 04/03/2001
Est. Priority Date: 10/22/1998
Status: Expired due to Fees

First Claim

Patent Images

1. A module information extraction system capable of extracting information from natural language documents, the system including a plurality of interchangeable modules, the system comprising:

a data preparation module for preparing a first set of raw data having class labels to be tested, the data preparation module being selected from a first type of the interchangeable modules;

a feature extraction module for extracting features from the raw data received from the data preparation module and storing the features in a vector format, the feature extraction module being selected from a second type of the interchangeable modules;

a core classification module for applying a learning algorithm to the stored vector format and producing therefrom a resulting classifier, the core classification module being selected from a third type of the interchangeable modules; and

a testing module for comparing the resulting classifier to a set of preassigned classes, the testing module being selected from a fourth type of the interchangeable modules, wherein the testing module tests a second set of raw data having class labels received by the data preparation module to determine whether the class labels of the second set of raw corresponds to the resulting classifier.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A module information extraction system capable of extracting information from natural language documents. The system includes a plurality of interchangeable modules including a data preparation module for preparing a first set of raw data having class labels to be tested, the data preparation module being selected from a first type of the interchangeable modules. The system further includes a feature extraction module for extracting features from the raw data received from the data preparation module and storing the features in a vector format, the feature extraction module being selected from a second type of the interchangeable modules. A core classification module is also provided for applying a learning algorithm to the stored vector format and producing therefrom a resulting classifier, the core classification module being selected from a third type of the interchangeable modules. A testing module compares the resulting classifier to a set of preassigned classes, where the testing module is selected from a fourth type of the interchangeable modules, where the testing module tests a second set of raw data having class labels received by the data preparation module to determine the degree to which the class labels of the second set of raw data approximately corresponds to the resulting classifier.

Citations

16 Claims

1. A module information extraction system capable of extracting information from natural language documents, the system including a plurality of interchangeable modules, the system comprising:
- a data preparation module for preparing a first set of raw data having class labels to be tested, the data preparation module being selected from a first type of the interchangeable modules;
  
  a feature extraction module for extracting features from the raw data received from the data preparation module and storing the features in a vector format, the feature extraction module being selected from a second type of the interchangeable modules;
  
  a core classification module for applying a learning algorithm to the stored vector format and producing therefrom a resulting classifier, the core classification module being selected from a third type of the interchangeable modules; and
  
  a testing module for comparing the resulting classifier to a set of preassigned classes, the testing module being selected from a fourth type of the interchangeable modules, wherein the testing module tests a second set of raw data having class labels received by the data preparation module to determine whether the class labels of the second set of raw corresponds to the resulting classifier.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The module information extraction system of claim 1, further comprising a module for storing the data in vector format after applying the learning algorithm to the vector format.
  - 3. The module information extraction system of claim 1, wherein:
4. The module information extraction system of claim 1, further comprising tool means for assisting with the preparation of the raw data for training, wherein the tool means is selected from a fifth type of the interchangeable modules and includes at least (i) filtering means for filtering out documents from the raw data before training and (ii) random splitting means for randomly splitting the raw data into a training set and a testing set of raw data.
5. The module information extraction system of claim 1, wherein the feature extraction module includes:
- a feature definition module for extracting the first set of raw data received from the data preparation module and preparing a feature count table in vector format; and
  
  a feature selection module for reducing or altering the feature count table in vector format based on user specified selection criteria, wherein the feature definition module and the feature selection module are individually driven.
6. The module information extraction system of claim 5, wherein:
- the feature selection module consults the feature count table in vector format when features of the first set of raw data are selected by the user and prepares the reduced feature count table in vector format based on the user specified selection criteria, and the core classification module consults the reduced feature count table in vector format when applying the learning algorithm.
7. The module information extraction system of claim 5, wherein:
- the testing module tests the second set of raw data having class labels received by the data preparation module to determine whether the class labels of the second set of raw data corresponds to the resulting classifier, and when the resulting classifier does not approximately coincide with the set of preassigned classes in the second set of raw data, the reduced feature count table is update by user specified selection criteria and the core classification module reapplies the learning algorithm to the updated reduced feature count table in vector format in order for the resulting classifier to approximately coincide with the set of preassigned classes in the second set of data.
8. The module information extraction system of claim 5, wherein the feature definition module further performs stemming, abbreviation expansion, names or term extraction, thereby defining the features to be used.
9. The module information extraction system of claim 1, wherein the feature extraction module further counts the frequencies of features in the raw data and provides the frequencies of the features in a feature count table.
10. The module information extraction system of claim 9, wherein the feature count table provides information so that successive training does not have to revisit the raw data.
11. The module information extraction system of claim 1, further comprising configuration means for storing documentation on how the classifiers are created by the core classification module in order to generate and retrieve a feature count table file.

12. A method for extracting information from natural language documents, comprising:
- preparing a first set of raw data having class labels to be tested;
  
  extracting features from the raw data and storing the features in a vector format;
  
  applying a learning algorithm to the stored vector format and producing therefrom a resulting classifier;
  
  comparing the resulting classifier to a set of preassigned classes;
  
  testing a second set of raw data having class labels to determine whether the class labels of the second set of raw data corresponds to the resulting classifier in the second set of raw data; and
  
  when the resulting classifier does not coincide with the set of preassigned classes in the second set of raw data, updating the reduced feature count table by user specified selection criteria and reapplying the learning algorithm to the updated reduced feature count table in vector format in order for the resulting classifier to approximately coincide with the set of preassigned classes in the second set of data.
- View Dependent Claims (13, 14, 15, 16)
- - 13. The method of claim 12, wherein:
14. The method of claim 12, wherein preparing the vector format includes:
- extracting the first set of raw data received from the data preparation module and preparing a feature count table in vector format; and
  
  preparing a reduced feature count table in vector format based on the user specified selection criteria, wherein the extracting and preparing are individually driven.
15. The method of claim 14, further comprising consulting the reduced feature count table in vector format when applying the learning algorithm.
16. The method of claim 14, wherein the feature count table in vector format is prepared one time and separate reduced feature count tables in vector format are prepared based on the feature count table in vector format when the resulting classifier does not approximately coincide with the set of preassigned classes.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Johnson, David B., Hampp-Bahamueller, Thomas
Primary Examiner(s)
Hong, Stephen S.

Application Number

US09/176,322
Time in Patent Office

894 Days
Field of Search

707/500, 707/501, 707/530, 707/3, 707/4, 707/5, 382/176
US Class Current

715/236
CPC Class Codes

G06F 16/355 Class or cluster creation o...

Y10S 707/99933 Query processing, i.e. sear...

Text categorization toolkit

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

16 Claims

Specification

Solutions

Use Cases

Quick Links

Text categorization toolkit

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

16 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links