Distributed method for integrating data mining and text categorization techniques
First Claim
1. A method for prediction analysis using text categorization, the method comprising the steps of:
- grouping a plurality of text documents into a plurality of classes;
selecting a top m most discriminatory terms for each class of documents using statistical based measures;
determining for each document the presence or absence of each of the discriminatory terms;
learning rule-based models of each class of documents using a rule learning algorithm;
determining, for at least a portion of the plurality of documents, if a given learned rule has been satisfied by each respective document;
creating a database of the rules associated with documents satisfying the rules; and
performing distributed data mining to form a predictive result based on at least a portion of the plurality of documents.
1 Assignment
0 Petitions
Accused Products
Abstract
A method for prediction analysis using text categorization is provided. The method includes the steps of: grouping a plurality of text documents into a plurality of classes; selecting a top m most discriminatory terms for each class of documents using statistical based measures; determining for each document the presence or absence of each of the discriminatory terms; learning rule-based models of each class of documents using a rule learning algorithm; determining, for at least a portion of the plurality of documents, if a given learned rule has been satisfied by each respective document; creating a database of the rules associated with documents satisfying the rules; and performing distributed data mining to form a predictive result based on at least a portion of the plurality of documents.
241 Citations
17 Claims
-
1. A method for prediction analysis using text categorization, the method comprising the steps of:
-
grouping a plurality of text documents into a plurality of classes;
selecting a top m most discriminatory terms for each class of documents using statistical based measures;
determining for each document the presence or absence of each of the discriminatory terms;
learning rule-based models of each class of documents using a rule learning algorithm;
determining, for at least a portion of the plurality of documents, if a given learned rule has been satisfied by each respective document;
creating a database of the rules associated with documents satisfying the rules; and
performing distributed data mining to form a predictive result based on at least a portion of the plurality of documents. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A method for prediction analysis using text categorization, the method comprising the steps of:
-
providing a structured data table having a plurality of class labels;
grouping a plurality of text documents into classes based on the class labels;
selecting a top m most discriminatory terms having the highest calculated fitness measure for each class of documents;
determining for each document the presence or absence of each of the discriminatory terms;
determining at least one concept for each class, the concept being associated with the respective class;
determining, for at least a portion of the plurality of documents, if a given concept is associated with each respective document;
forming a numeric vector for each document indicating if the document is associated with each respective concept;
creating a structured data table of the vectors; and
performing distributed data mining on the structured data table to form a predictive result. - View Dependent Claims (8, 9, 10, 11)
-
-
12. A method for prediction analysis using text categorization, the method comprising the steps of:
-
providing a structured data table having a plurality of class labels;
grouping a plurality of text documents into classes based on the class labels;
selecting a top m most discriminatory terms having the highest calculated fitness measure for each class of documents;
determining for each document the presence or absence of each of the discriminatory terms;
determining a concept for each class, the concept being associated with the respective class;
determining, for at least a portion of the plurality of documents, if a given concept is associated with each respective document;
creating a database of the concepts and the associated documents; and
performing distributed data mining on the database to form a predictive result. - View Dependent Claims (13, 14, 15, 16)
-
-
17. A system for prediction analysis using text categorization comprising:
-
at least one memory unit; and
a plurality of processing units, the plurality of processing units grouping a plurality of text documents into a plurality of classes, selecting a top m most discriminatory terms for each class of documents using statistical based measures, determining for each document the presence or absence of each of the discriminatory terms, learning rule-based models of each class of documents using a rule learning algorithm, determining, for at least a portion of the plurality of documents, if a given learned rule has been satisfied by each respective document, creating a database of the rules associated with documents satisfying the rules and performing distributed data mining to form a predictive result based on at least a portion of the plurality of documents.
-
Specification