Distributed hierarchical text classification framework
First Claim
1. A method in a computing device with a processor for training a hierarchical classifier for classification of documents into a classification hierarchy, the method comprising:
- providing the classification hierarchy in which classifications have sub-classifications except for leaf classifications;
providing training data for training the classifiers, the training data including documents and classifications of the documents within the classification hierarchy, the classification of a document indicating that the document is in that classification and ancestor classifications of that classification, each classification having a number of documents;
generating a classifier for each classification within the classification hierarchy by, for each classification within the classification hierarchy,determining a complexity for the classifier for the classification, the complexity of the classifier varying nonlinearly based on the number of documents within the classification;
identifying by the processor one of a plurality of agents to train the classifier for that classification such that one agent is identified to train one classifier and some of the agents are identified to train multiple classifiers, the agents being identified to balance training load of the agents that is determined based on the determined complexity of the classifiers identified to be trained by each agent wherein the identifying of one of the agents includes;
when a classifier has not yet been assigned to an agent, assigning the classifier to that agent; and
when a classifier has already been assigned to each agent, assigning the classifier to an agent based on complexity of the classifier and complexities of classifiers assigned to each agent such that a classifier with the highest complexity is assigned to an agent that has been assigned classifiers with the smallest total complexity; and
under control of the identified agent, training the classifier for that classification using the documents of the training data that are classified within that classification of the classification hierarchy;
wherein each agent trains classifiers for a varying number of documents of the training data,wherein the classifiers trained by the multiple agents form the hierarchical classifier, andwherein the agent for a classifier is identified based on number of documents used.
2 Assignments
0 Petitions
Accused Products
Abstract
A method and system for distributed training of a hierarchical classifier for classifying documents using a classification hierarchy is provided. A training system provides training data that includes the documents and classifications of the documents within the classification hierarchy. The training system distributes the training of the classifiers of the hierarchical classifier to various agents so that the classifiers can be trained in parallel. For each classifier, the training system identifies an agent that is to train the classifier. Each agent then trains its classifiers.
-
Citations
15 Claims
-
1. A method in a computing device with a processor for training a hierarchical classifier for classification of documents into a classification hierarchy, the method comprising:
-
providing the classification hierarchy in which classifications have sub-classifications except for leaf classifications; providing training data for training the classifiers, the training data including documents and classifications of the documents within the classification hierarchy, the classification of a document indicating that the document is in that classification and ancestor classifications of that classification, each classification having a number of documents; generating a classifier for each classification within the classification hierarchy by, for each classification within the classification hierarchy, determining a complexity for the classifier for the classification, the complexity of the classifier varying nonlinearly based on the number of documents within the classification; identifying by the processor one of a plurality of agents to train the classifier for that classification such that one agent is identified to train one classifier and some of the agents are identified to train multiple classifiers, the agents being identified to balance training load of the agents that is determined based on the determined complexity of the classifiers identified to be trained by each agent wherein the identifying of one of the agents includes; when a classifier has not yet been assigned to an agent, assigning the classifier to that agent; and when a classifier has already been assigned to each agent, assigning the classifier to an agent based on complexity of the classifier and complexities of classifiers assigned to each agent such that a classifier with the highest complexity is assigned to an agent that has been assigned classifiers with the smallest total complexity; and under control of the identified agent, training the classifier for that classification using the documents of the training data that are classified within that classification of the classification hierarchy; wherein each agent trains classifiers for a varying number of documents of the training data, wherein the classifiers trained by the multiple agents form the hierarchical classifier, and wherein the agent for a classifier is identified based on number of documents used. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A computer system with a processor and a memory for training a hierarchical classifier for classification into a classification hierarchy, comprising:
-
a classification hierarchy store containing a classification hierarchy in which classifications have sub-classifications except for leaf classifications; a training data store having training data for training classifiers of the hierarchical classifier, the training data including documents and classifications of the documents within the classification hierarchy, the classification of a document indicating that the document is in that classification and ancestor classifications of that classification as specified by the classification hierarchy; and a select features for classifier component, a controller, and a plurality of agents implemented as instructions stored in the memory for execution by the processor such that the select features for classifier component that for each classification of the classification hierarchy, identifies features of the documents of the training data that are to be used for training a classifier for that classification; the controller that, for each classifier of a classification within the classification hierarchy, identifies one of a plurality of agents to train the classifier and notifies the identified agent to train the classifier, wherein the controller identifies agents by assigning a classifier to each agent and then assigning unassigned classifiers to agents based on complexities of training the unassigned classifiers and complexities of training classifiers already assigned to each agent, wherein an unassigned classifier with the highest complexity is assigned to an agent that has been assigned classifiers with the smallest total complexity, and wherein the complexity of training a classifier varies nonlinearly based on number of documents in the training data for that classifier, each classifier for a classification being trained using documents classified into the classification of which the classification is a sub-classification; and the plurality of agents executing on different computer devices that receive notifications to train classifiers and train the classifiers using the features of the documents identified for each classification from the training data wherein the classifiers trained by the multiple agents form the hierarchical classifier. - View Dependent Claims (11, 12, 13)
-
-
14. A computer-readable storage medium containing instructions for controlling a computer to train a hierarchical classifier for classification into classification hierarchy, by a method comprising:
-
providing the classification hierarchy in which classifications have sub-classifications except for leaf classifications; training data for training classifiers of the hierarchical classifier, the training data including documents and classifications of the documents within the classification hierarchy, the classification of a document indicating that the document is in that classification and ancestor classifications of that classification, each classification having a number of documents; and for each classifier of a classification within the classification hierarchy, determining a complexity for the classifier for the classification, the complexity of the classifier varying nonlinearly based on the number of documents within the classification; identifying one of a plurality of agents to train the classifier based on complexities of classifiers assigned to the agent and the determined complexity of the classifier such that one agent is identified to train one classifier and some of the agents are identified to train multiple classifiers, the agents being identified to balance training load of the agents that is determined based on the determined complexity of training the classifiers identified to be trained by each agent wherein the identifying of one of the agents includes; when a classifier has not yet been assigned to an agent, assigning the classifier to that agent; and when a classifier has already been assigned to each agent, assigning the classifier to an agent based on complexity of the classifier and complexities of classifiers assigned to each agent such that a classifier with the highest complexity is assigned to an agent that has been assigned classifiers with the smallest total complexity; and notifying the identified agent to train the classifier using training data that includes documents and classifications. - View Dependent Claims (15)
-
Specification