Using rule induction to identify emerging trends in unstructured text streams
First Claim
Patent Images
1. A system including a computer processor configured to operate a plurality of modules, said modules comprising:
- a decision module configured to use a decision tree to classify documents from a set U of documents into categories based on a subset V of U, wherein the subset V comprises documents of U that were written within a specific time period, and the subset V provides an indication of emerging trends in the set U of documents that occur at a higher frequency during the specific time period than outside the specific time period,wherein the decision module utilizes an entropy function that favors splitting the set U into categories, andwherein the decision module creates a separate category for the documents in V and also the documents in U that are not in V;
a conversion module configured to convert the decision tree into a logically equivalent rule set, wherein each document of U is guaranteed to only be classified by one rule of the rule set, wherein the rule set is configured as a sortable table;
a labeling module configured to label, for each one of the categories based on the subset V, a text event, wherein the labeling module is configured to label the text event with each of a plurality of antecedents including positive and negative antecedents on a path from a leaf node to the root node of the decision tree, wherein each antecedent corresponds to a particular leaf node on the path; and
a display module configured to display a list of results based on the text event labels to a user.
1 Assignment
0 Petitions
Accused Products
Abstract
A method for identifying emerging concepts in unstructured text streams comprises: selecting a subset V of documents from a set U of documents; generating at least one Boolean combination of terms that partitions the set U into a plurality of categories that represent a generalized, statistically based model of the selected subset V wherein the categories are disjoint inasmuch as each document of U is included in only one category of the partition; and generating a descriptive label for each of the disjoint categories from the Boolean combination of terms for that category.
17 Citations
9 Claims
-
1. A system including a computer processor configured to operate a plurality of modules, said modules comprising:
-
a decision module configured to use a decision tree to classify documents from a set U of documents into categories based on a subset V of U, wherein the subset V comprises documents of U that were written within a specific time period, and the subset V provides an indication of emerging trends in the set U of documents that occur at a higher frequency during the specific time period than outside the specific time period, wherein the decision module utilizes an entropy function that favors splitting the set U into categories, and wherein the decision module creates a separate category for the documents in V and also the documents in U that are not in V; a conversion module configured to convert the decision tree into a logically equivalent rule set, wherein each document of U is guaranteed to only be classified by one rule of the rule set, wherein the rule set is configured as a sortable table; a labeling module configured to label, for each one of the categories based on the subset V, a text event, wherein the labeling module is configured to label the text event with each of a plurality of antecedents including positive and negative antecedents on a path from a leaf node to the root node of the decision tree, wherein each antecedent corresponds to a particular leaf node on the path; and a display module configured to display a list of results based on the text event labels to a user. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A computer program product comprising a non-transitory computer useable medium including a computer readable program, wherein the computer readable program when executed on a computer causes the computer to:
-
identify a dictionary of frequently used terms in a text data set U, wherein identifying the dictionary comprises representing each document of U as a vector of weighted frequencies of the document features, the document features being words and phrases contained in the document, wherein the vector is normalized to have unit Euclidean norm; create a feature space that identifies the dictionary term occurrences in each document of U; apply a rule induction algorithm to the feature space over U to identify rules that classify documents into categories based on a subset V of U, wherein the rule induction algorithm utilizes an entropy function that favors splitting the set U into categories, and wherein the rule induction algorithm creates a separate category for the documents in V and also the documents in U that are not in V; use feature based antecedents of each rule to describe events; and display the events using positive and negative antecedents, wherein the subset V comprises documents of U that were written within a specific time period, and the subset V provides an indication of emerging trends in the set U of documents that occur at a higher frequency during the specific time period than outside the specific time period. - View Dependent Claims (7, 8, 9)
-
Specification