Reducing human overhead in text categorization
First Claim
1. A computer-implemented multi-stage classification system that facilitates reducing human effort in text classification while obtaining a desired level of accuracy comprising:
- one or more processors;
memory, accessible by the one or more processors;
a pattern-based classifier component stored in the memory and executable on the one or more processors to assign the input a label assign a label to the input, and to build one or more suffix arrays over a subset of text from the set of training items to determine correlation between each pattern of a plurality of patterns and the label, the pattern-based classifier component to classify the input as having a pattern of the plurality of patterns when a corresponding correlation satisfies a correlation threshold the set of training items comprising at least one of text documents, messages, and files that are each labeled based on one or more text patterns; and
a learning-based classifier component stored in the memory and executable on the one or more processors to process the input for classification when no label is assigned to the input by the pattern-based classifier component.
2 Assignments
0 Petitions
Accused Products
Abstract
A unique multi-stage classification system and method that facilitates reducing human resources or costs associated with text classification while still obtaining a desired level of accuracy is provided. The multi-stage classification system and method involve a pattern-based classifier and a machine learning classifier. The pattern-based classifier is trained on discriminative patterns as identified by humans rather than machines which allow a smaller training set to be employed. Given humans'"'"' superior abilities to reason over text, discriminative patterns can be more accurately and more readily identified by them. Unlabeled items can be initially processed by the pattern-based classifier and if no pattern match exists, then the unlabeled data can be processed by the machine learning classifier. By employing the classifiers in this manner, less human involvement is required in the classification process. Even more, classification accuracy is maintained and/or improved.
40 Citations
16 Claims
-
1. A computer-implemented multi-stage classification system that facilitates reducing human effort in text classification while obtaining a desired level of accuracy comprising:
-
one or more processors; memory, accessible by the one or more processors; a pattern-based classifier component stored in the memory and executable on the one or more processors to assign the input a label assign a label to the input, and to build one or more suffix arrays over a subset of text from the set of training items to determine correlation between each pattern of a plurality of patterns and the label, the pattern-based classifier component to classify the input as having a pattern of the plurality of patterns when a corresponding correlation satisfies a correlation threshold the set of training items comprising at least one of text documents, messages, and files that are each labeled based on one or more text patterns; and a learning-based classifier component stored in the memory and executable on the one or more processors to process the input for classification when no label is assigned to the input by the pattern-based classifier component. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A computer-readable memory having computer-executable instructions that, when executed, cause one or more processors to perform a method of classification that facilitates reducing human effort in text classification while obtaining a desired level of accuracy, comprising:
-
training a pattern-based classifier which comprises; providing a set of training texts; identifying patterns in the training texts, the patterns being an ordered string over words comprising at least one of optional words, disjunctions and gaps; presenting at least a subset of the patterns to a user; receiving a selection of at least one of the subset of the patterns as one or more discriminative patterns by the user; and assigning a corresponding label to each discriminative pattern; receiving an unlabeled input document; and classifying the unlabeled input document using a multi-stage classifier, the multi-stage classifier comprising the pattern-based classifier and a learning-based classifier, wherein the unlabeled input is classified using the learning-based classifier when the pattern-based classifier is unable to assign at least one corresponding label of the one or more discriminative pattern to the unlabeled input document as the unlabeled input document does not match the one or more discriminative patterns. - View Dependent Claims (8, 9, 10, 11, 12, 13)
-
-
14. A computer-readable memory having computer-executable instructions that, when executed, cause one or more processors to perform a method of classification that facilitates minimizing human effort in text classification while obtaining a desired level of accuracy comprising:
-
training a pattern-based classifier, the training comprises; providing a set of training documents; identifying patterns in the set of training documents, the patterns being an ordered string over words comprising at least one of optional words, disjunctions and gaps; selecting a subset of the identified patterns that satisfy at least one of a frequency threshold and a correlation threshold; and determining a discriminative pattern from the subset of patterns; running an unlabeled text item through the pattern-based classifier to determine whether one or more portions of text in the unlabeled text item matches the discriminative pattern; when the one or more portions of the text matches the discriminative pattern, assigning a label that corresponds to the discriminative pattern to the text item; and when the one or more portions of the text does not match the discriminative pattern, running the unlabeled text item through a machine learning classifier. - View Dependent Claims (15, 16)
-
Specification