Using rule induction to identify emerging trends in unstructured text streams

US 8,712,926 B2
Filed: 05/23/2008
Issued: 04/29/2014
Est. Priority Date: 05/23/2008
Status: Active Grant

First Claim

Patent Images

1. A system including a computer processor configured to operate a plurality of modules, said modules comprising:

a decision module configured to use a decision tree to classify documents from a set U of documents into categories based on a subset V of U, wherein the subset V comprises documents of U that were written within a specific time period, and the subset V provides an indication of emerging trends in the set U of documents that occur at a higher frequency during the specific time period than outside the specific time period,wherein the decision module utilizes an entropy function that favors splitting the set U into categories, andwherein the decision module creates a separate category for the documents in V and also the documents in U that are not in V;

a conversion module configured to convert the decision tree into a logically equivalent rule set, wherein each document of U is guaranteed to only be classified by one rule of the rule set, wherein the rule set is configured as a sortable table;

a labeling module configured to label, for each one of the categories based on the subset V, a text event, wherein the labeling module is configured to label the text event with each of a plurality of antecedents including positive and negative antecedents on a path from a leaf node to the root node of the decision tree, wherein each antecedent corresponds to a particular leaf node on the path; and

a display module configured to display a list of results based on the text event labels to a user.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method for identifying emerging concepts in unstructured text streams comprises: selecting a subset V of documents from a set U of documents; generating at least one Boolean combination of terms that partitions the set U into a plurality of categories that represent a generalized, statistically based model of the selected subset V wherein the categories are disjoint inasmuch as each document of U is included in only one category of the partition; and generating a descriptive label for each of the disjoint categories from the Boolean combination of terms for that category.

17 Citations

View as Search Results

9 Claims

1. A system including a computer processor configured to operate a plurality of modules, said modules comprising:
- a decision module configured to use a decision tree to classify documents from a set U of documents into categories based on a subset V of U, wherein the subset V comprises documents of U that were written within a specific time period, and the subset V provides an indication of emerging trends in the set U of documents that occur at a higher frequency during the specific time period than outside the specific time period,wherein the decision module utilizes an entropy function that favors splitting the set U into categories, andwherein the decision module creates a separate category for the documents in V and also the documents in U that are not in V;
  
  a conversion module configured to convert the decision tree into a logically equivalent rule set, wherein each document of U is guaranteed to only be classified by one rule of the rule set, wherein the rule set is configured as a sortable table;
  
  a labeling module configured to label, for each one of the categories based on the subset V, a text event, wherein the labeling module is configured to label the text event with each of a plurality of antecedents including positive and negative antecedents on a path from a leaf node to the root node of the decision tree, wherein each antecedent corresponds to a particular leaf node on the path; and
  
  a display module configured to display a list of results based on the text event labels to a user.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The system of claim 1, wherein each leaf node classifies documents for one of the categories based on the subset V.
  - 3. The system of claim 1, wherein the display module is configured to:
    - remove negative antecedents from a text event label; and
      
      display positive antecedents of the text event label.
  - 4. The system of claim 1, wherein the display module is configured to:
    - remove negative antecedents from a text event label; and
      
      display the text event as “
      
      Miscellaneous”
      
      if the category of the text event has no positive antecedents in the text event label.
  - 5. The system of claim 2, wherein:
    - a feature space is created over U;
      
      the decision tree is applied to the feature space in classifying the documents of U; and
      
      the plurality of antecedents are based on features of the feature space.

6. A computer program product comprising a non-transitory computer useable medium including a computer readable program, wherein the computer readable program when executed on a computer causes the computer to:
- identify a dictionary of frequently used terms in a text data set U, wherein identifying the dictionary comprises representing each document of U as a vector of weighted frequencies of the document features, the document features being words and phrases contained in the document, wherein the vector is normalized to have unit Euclidean norm;
  
  create a feature space that identifies the dictionary term occurrences in each document of U;
  
  apply a rule induction algorithm to the feature space over U to identify rules that classify documents into categories based on a subset V of U,wherein the rule induction algorithm utilizes an entropy function that favors splitting the set U into categories, andwherein the rule induction algorithm creates a separate category for the documents in V and also the documents in U that are not in V;
  
  use feature based antecedents of each rule to describe events; and
  
  display the events using positive and negative antecedents,wherein the subset V comprises documents of U that were written within a specific time period, and the subset V provides an indication of emerging trends in the set U of documents that occur at a higher frequency during the specific time period than outside the specific time period.
- View Dependent Claims (7, 8, 9)
- - 7. The computer program product of claim 6, wherein:
    - the computer removes duplicates from the text data set U; and
      
      the categories define emerging concepts in the text data set U.
  - 8. The computer program product of claim 6, wherein:
    - creating the feature space comprises indexing the documents of U by their feature occurrences using the vector of weighted frequencies of the document features.
  - 9. The computer program product of claim 6, wherein:
    - the rule induction algorithm is based on a decision tree; and
      
      each event is described by labeling the event with the antecedents that occur on the path in the decision tree from the leaf node of the event to the root node.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Behal, Amit, Chen, Ying, Spangler, William Scott
Primary Examiner(s)
RIFKIN, BEN M

Application Number

US12/126,829
Publication Number

US 20090292660A1
Time in Patent Office

2,167 Days
Field of Search

706/12
US Class Current

706/12
CPC Class Codes

G06F 16/93 Document management systems

G06N 20/00 Machine learning

Using rule induction to identify emerging trends in unstructured text streams

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

17 Citations

9 Claims

Specification

Solutions

Use Cases

Quick Links

Using rule induction to identify emerging trends in unstructured text streams

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

17 Citations

9 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links