Document classification and characterization
First Claim
1. A method comprising:
- receiving, by at least one data processor, data characterizing each of a plurality of documents within a document set;
grouping, by at least one data processor, the plurality of documents into a plurality of stacks using one or more grouping algorithms, wherein key words are identified in each document and weights specified by a scorecard scoring model are assigned to variables corresponding to each key word, wherein a scoring algorithm, using corresponding variables and weights, provides a score for each document which is used by the grouping algorithm when grouping the documents;
identifying, by at least one data processor, a prime document for each stack, the prime document including attributes representative of the entire stack;
providing, by at least one data processor, data characterizing documents for each stack including at least the identified prime document to at least one human reviewer;
receiving, by at least one data processor, user-generated input from the human reviewer categorizing each provided document;
providing, by at least one data processor, data characterizing the user-generated input;
evaluating, by at least one data processor, identified grouping errors using at least one of Z-test techniques and multiple regression techniques;
determining, by at least one data processor based on the evaluating, a relative contribution of the variables used by the grouping algorithm to the grouping errors; and
modifying, by at least one data processor, the grouping algorithm so that at least one weight assigned by the grouping algorithm off-sets the relative contribution of the variables used by the grouping algorithm to an error rate, wherein subsequently received documents are grouped using the modified grouping algorithm.
8 Assignments
0 Petitions
Accused Products
Abstract
Data is received that characterizes each of a plurality of documents within a document set. Based on this data, the plurality of documents are grouped into a plurality of stacks using one or more grouping algorithms. A prime document is identified for each stack that includes attributes representative of the entire stack. Subsequently, provision of data is provided that characterizes documents for each stack including at least the identified prime document to at least one human reviewer. User-generated input from the human reviewer is later received that categorized each provided document and data characterizing the user-generated input can then be provided. Related apparatus, systems, techniques and articles are also described.
-
Citations
34 Claims
-
1. A method comprising:
-
receiving, by at least one data processor, data characterizing each of a plurality of documents within a document set; grouping, by at least one data processor, the plurality of documents into a plurality of stacks using one or more grouping algorithms, wherein key words are identified in each document and weights specified by a scorecard scoring model are assigned to variables corresponding to each key word, wherein a scoring algorithm, using corresponding variables and weights, provides a score for each document which is used by the grouping algorithm when grouping the documents; identifying, by at least one data processor, a prime document for each stack, the prime document including attributes representative of the entire stack; providing, by at least one data processor, data characterizing documents for each stack including at least the identified prime document to at least one human reviewer; receiving, by at least one data processor, user-generated input from the human reviewer categorizing each provided document; providing, by at least one data processor, data characterizing the user-generated input; evaluating, by at least one data processor, identified grouping errors using at least one of Z-test techniques and multiple regression techniques; determining, by at least one data processor based on the evaluating, a relative contribution of the variables used by the grouping algorithm to the grouping errors; and modifying, by at least one data processor, the grouping algorithm so that at least one weight assigned by the grouping algorithm off-sets the relative contribution of the variables used by the grouping algorithm to an error rate, wherein subsequently received documents are grouped using the modified grouping algorithm. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
-
-
19. An article of manufacture comprising:
computer executable instructions stored on non-transitory computer readable media, which, when executed by a computer, causes the computer to perform operations comprising; receiving data characterizing each of a plurality of documents within each of a plurality of document sets; grouping the plurality of documents into a plurality of stacks using one or more grouping algorithms, wherein key words are identified in each document and weights specified by a scorecard scoring model are assigned to variables corresponding to each key word, wherein a scoring algorithm, using corresponding variables and weights, provides a score for each document which is used by the grouping algorithm when grouping the documents; identifying a prime document for each stack, the prime document including attributes representative of the entire stack; providing data characterizing documents for each stack including at least the identified prime document to at least one human reviewer; receiving user-generated input from the human reviewer categorizing each provided document; providing data characterizing the user-generated input; evaluating, by at least one data processor, identified grouping errors using at least one of Z-test techniques and multiple regression techniques; determining, by at least one data processor based on the evaluating, a relative contribution of the variables used by the grouping algorithm to the grouping errors; and modifying, by at least one data processor, the grouping algorithm so that at least one weight assigned by the grouping algorithm off-sets the relative contribution of the variables used by the grouping algorithm to an error rate, wherein documents in subsequently received sets of documents are grouped using the modified grouping algorithm. - View Dependent Claims (20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34)
Specification