Distributed method for integrating data mining and text categorization techniques

US 20080097937A1
Filed: 09/28/2007
Published: 04/24/2008
Est. Priority Date: 07/10/2003
Status: Abandoned Application

First Claim

Patent Images

1. A method for prediction analysis using text categorization, the method comprising the steps of:

grouping a plurality of text documents into a plurality of classes;

selecting a top m most discriminatory terms for each class of documents using statistical based measures;

determining for each document the presence or absence of each of the discriminatory terms;

learning rule-based models of each class of documents using a rule learning algorithm;

determining, for at least a portion of the plurality of documents, if a given learned rule has been satisfied by each respective document;

creating a database of the rules associated with documents satisfying the rules; and

performing distributed data mining to form a predictive result based on at least a portion of the plurality of documents.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method for prediction analysis using text categorization is provided. The method includes the steps of: grouping a plurality of text documents into a plurality of classes; selecting a top m most discriminatory terms for each class of documents using statistical based measures; determining for each document the presence or absence of each of the discriminatory terms; learning rule-based models of each class of documents using a rule learning algorithm; determining, for at least a portion of the plurality of documents, if a given learned rule has been satisfied by each respective document; creating a database of the rules associated with documents satisfying the rules; and performing distributed data mining to form a predictive result based on at least a portion of the plurality of documents.

241 Citations

17 Claims

1. A method for prediction analysis using text categorization, the method comprising the steps of:
- grouping a plurality of text documents into a plurality of classes;
  
  selecting a top m most discriminatory terms for each class of documents using statistical based measures;
  
  determining for each document the presence or absence of each of the discriminatory terms;
  
  learning rule-based models of each class of documents using a rule learning algorithm;
  
  determining, for at least a portion of the plurality of documents, if a given learned rule has been satisfied by each respective document;
  
  creating a database of the rules associated with documents satisfying the rules; and
  
  performing distributed data mining to form a predictive result based on at least a portion of the plurality of documents.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The method of claim 1 further comprising the step of representing each document in terms of a numeric vector indicating the presence or absence of the discriminatory terms.
  - 3. The method of claim 1 wherein the plurality of text documents are from an unstructured database.
  - 4. The method of claim 1 further comprising the step of representing each document in terms of a numeric vector indicating whether a learned rule has been satisfied by the document.
  - 5. The method of claim 1 wherein the step of performing data mining includes utilizing a decision tree to form the predictive result.
  - 6. The method of claim 1 wherein the step of performing data mining includes the steps of:
    - collecting candidate attributes by a mediator from a plurality of agents;
      
      selecting a winning agent;
      
      initiating data splitting by the winning agent;
      
      forwarding split data index information from the winning agent to the mediator;
      
      forwarding the split data index information from the mediator to each of the agents; and
      
      initiating data splitting by each of the agents other than the winning agent.

7. A method for prediction analysis using text categorization, the method comprising the steps of:
- providing a structured data table having a plurality of class labels;
  
  grouping a plurality of text documents into classes based on the class labels;
  
  selecting a top m most discriminatory terms having the highest calculated fitness measure for each class of documents;
  
  determining for each document the presence or absence of each of the discriminatory terms;
  
  determining at least one concept for each class, the concept being associated with the respective class;
  
  determining, for at least a portion of the plurality of documents, if a given concept is associated with each respective document;
  
  forming a numeric vector for each document indicating if the document is associated with each respective concept;
  
  creating a structured data table of the vectors; and
  
  performing distributed data mining on the structured data table to form a predictive result.
- View Dependent Claims (8, 9, 10, 11)
- - 8. The method of claim 7 further comprising the step of representing each document in terms of a numeric vector indicating the presence or absence of the discriminatory terms.
  - 9. The method of claim 7 wherein the plurality of text documents are from an unstructured database.
  - 10. The method of claim 7 wherein the step of performing data mining includes utilizing a decision tree to form the predictive result.
  - 11. The method of claim 7 wherein the step of performing data mining includes the steps of:
    - collecting candidate attributes by a mediator from a plurality of agents;
      
      selecting a winning agent;
      
      initiating data splitting by the winning agent;
      
      forwarding split data index information from the winning agent to the mediator;
      
      forwarding the split data index information from the mediator to each of the agents; and
      
      initiating data splitting by each of the agents other than the winning agent.

12. A method for prediction analysis using text categorization, the method comprising the steps of:
- providing a structured data table having a plurality of class labels;
  
  grouping a plurality of text documents into classes based on the class labels;
  
  selecting a top m most discriminatory terms having the highest calculated fitness measure for each class of documents;
  
  determining for each document the presence or absence of each of the discriminatory terms;
  
  determining a concept for each class, the concept being associated with the respective class;
  
  determining, for at least a portion of the plurality of documents, if a given concept is associated with each respective document;
  
  creating a database of the concepts and the associated documents; and
  
  performing distributed data mining on the database to form a predictive result.
- View Dependent Claims (13, 14, 15, 16)
- - 13. The method of claim 12 further comprising the step of representing each document in terms of a numeric vector indicating the presence or absence of the discriminatory terms.
  - 14. The method of claim 12 wherein the plurality of text documents are from an unstructured database.
  - 15. The method of claim 12 wherein the step of performing data mining includes utilizing a decision tree to form the predictive result.
  - 16. The method of claim 12 wherein the step of performing data mining includes the steps of:
    - collecting candidate attributes by a mediator from a plurality of agents;
      
      selecting a winning agent;
      
      initiating data splitting by the winning agent;
      
      forwarding split data index information from the winning agent to the mediator;
      
      forwarding the split data index information from the mediator to each of the agents; and
      
      initiating data splitting by each of the agents other than the winning agent.

17. A system for prediction analysis using text categorization comprising:
- at least one memory unit; and
  
  a plurality of processing units, the plurality of processing units grouping a plurality of text documents into a plurality of classes, selecting a top m most discriminatory terms for each class of documents using statistical based measures, determining for each document the presence or absence of each of the discriminatory terms, learning rule-based models of each class of documents using a rule learning algorithm, determining, for at least a portion of the plurality of documents, if a given learned rule has been satisfied by each respective document, creating a database of the rules associated with documents satisfying the rules and performing distributed data mining to form a predictive result based on at least a portion of the plurality of documents.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Inferx Corporation
Original Assignee
Inferx Corporation
Inventors
Hadjarian, Ali

Application Number

US11/904,674
Publication Number

US 20080097937A1
Time in Patent Office

Days
Field of Search
US Class Current

706/12
CPC Class Codes

G06F 16/2465 Query processing support fo...

G06N 20/00 Machine learning

Distributed method for integrating data mining and text categorization techniques

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

241 Citations

17 Claims

Specification

Solutions

Use Cases

Quick Links

Distributed method for integrating data mining and text categorization techniques

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

241 Citations

17 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links