Document classification and characterization

US 8,396,871 B2
Filed: 01/26/2011
Issued: 03/12/2013
Est. Priority Date: 01/26/2011
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

receiving, by at least one data processor, data characterizing each of a plurality of documents within a document set;

grouping, by at least one data processor, the plurality of documents into a plurality of stacks using one or more grouping algorithms, wherein key words are identified in each document and weights specified by a scorecard scoring model are assigned to variables corresponding to each key word, wherein a scoring algorithm, using corresponding variables and weights, provides a score for each document which is used by the grouping algorithm when grouping the documents;

identifying, by at least one data processor, a prime document for each stack, the prime document including attributes representative of the entire stack;

providing, by at least one data processor, data characterizing documents for each stack including at least the identified prime document to at least one human reviewer;

receiving, by at least one data processor, user-generated input from the human reviewer categorizing each provided document;

providing, by at least one data processor, data characterizing the user-generated input;

evaluating, by at least one data processor, identified grouping errors using at least one of Z-test techniques and multiple regression techniques;

determining, by at least one data processor based on the evaluating, a relative contribution of the variables used by the grouping algorithm to the grouping errors; and

modifying, by at least one data processor, the grouping algorithm so that at least one weight assigned by the grouping algorithm off-sets the relative contribution of the variables used by the grouping algorithm to an error rate, wherein subsequently received documents are grouped using the modified grouping algorithm.

View all claims

8 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Data is received that characterizes each of a plurality of documents within a document set. Based on this data, the plurality of documents are grouped into a plurality of stacks using one or more grouping algorithms. A prime document is identified for each stack that includes attributes representative of the entire stack. Subsequently, provision of data is provided that characterizes documents for each stack including at least the identified prime document to at least one human reviewer. User-generated input from the human reviewer is later received that categorized each provided document and data characterizing the user-generated input can then be provided. Related apparatus, systems, techniques and articles are also described.

Citations

34 Claims

1. A method comprising:
- receiving, by at least one data processor, data characterizing each of a plurality of documents within a document set;
  
  grouping, by at least one data processor, the plurality of documents into a plurality of stacks using one or more grouping algorithms, wherein key words are identified in each document and weights specified by a scorecard scoring model are assigned to variables corresponding to each key word, wherein a scoring algorithm, using corresponding variables and weights, provides a score for each document which is used by the grouping algorithm when grouping the documents;
  
  identifying, by at least one data processor, a prime document for each stack, the prime document including attributes representative of the entire stack;
  
  providing, by at least one data processor, data characterizing documents for each stack including at least the identified prime document to at least one human reviewer;
  
  receiving, by at least one data processor, user-generated input from the human reviewer categorizing each provided document;
  
  providing, by at least one data processor, data characterizing the user-generated input;
  
  evaluating, by at least one data processor, identified grouping errors using at least one of Z-test techniques and multiple regression techniques;
  
  determining, by at least one data processor based on the evaluating, a relative contribution of the variables used by the grouping algorithm to the grouping errors; and
  
  modifying, by at least one data processor, the grouping algorithm so that at least one weight assigned by the grouping algorithm off-sets the relative contribution of the variables used by the grouping algorithm to an error rate, wherein subsequently received documents are grouped using the modified grouping algorithm.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
- - 2. A method as in claim 1, wherein categorization of each provided document by the user-generated input is propagated to all documents within the corresponding stack.
  - 3. A method as in claim 2, wherein documents are incrementally added to the document set after the grouping of the plurality of documents into the plurality of stacks, and wherein the method further comprises:
    - associating, by at least one data processor, the incrementally added documents to one of the plurality of stacks; and
      
      for each stack;
      
      if the stack has already been categorized, adding, by at least one data processor, the corresponding incrementally added documents to the stack and propagating the categorization to the incrementally added documents in such stack;
      
      orif the stack has not been categorized, adding, by at least one data processor, the incrementally added documents to the stack.
  - 4. A method as in claim 2, wherein at least one documents is incrementally added to the document set after the grouping of the plurality of documents into the plurality of stacks, and wherein the method further comprises:
    - determining, by at least one data processor, that the at least one incrementally added document is not associated with a previously defined stack; and
      
      defining, by at least one data processor, a new stack including the at least one incrementally added document.
  - 5. A method as in claim 2, further comprising:
    - defining, by at least one data processor, hierarchical relationships among the plurality of documents within the set of documents; and
      
      wherein the grouping algorithms take into account the relationships between documents when grouping the plurality of documents into the plurality of stacks.
  - 6. A method as in claim 1, wherein the human reviewer categorizes each provided document in a group of document review categories.
  - 7. A method as in claim 6, wherein the document review categories are selected from a group comprising:
    - relevance, responsiveness, and privilege.
  - 8. A method as in claim 1, further comprising:
    - sending, by at least one data processor, data characterizing supplemental documents within a stack other than the provided documents to at least one human reviewer for quality control.
  - 9. A method as in claim 8, further comprising:
    - selecting, by at least one data processor, supplemental documents whose data is sent to at least one human reviewer for quality control based on an algorithm designed to select documents based on their likelihood to require remediation.
  - 10. A method as in claim 8, further comprising:
    - defining, by at least one data processor, tiers of documents within each stack, and wherein the supplemental documents comprise documents from one or more tiers.
  - 11. A method as in claim 10, wherein the tiers are based on one or more of:
    - document similarity relative to the corresponding prime document, document type, document author, document sender, document recipient.
  - 12. A method as in claim 1, wherein the documents in the stacks are disjoint.
  - 13. A method as in claim 1, wherein providing the data comprises one or more of:
    - displaying the data, transmitting the data to a remote computing system, and persisting the data.
  - 14. A method as in claim 1, wherein the data characterizing documents for each stack provided to the at least one human reviewer comprise reference numbers for such documents.
  - 15. A method as in claim 1, wherein the data characterizing documents for each stack provided to the at least one human reviewer comprise digitally scanned versions of such documents.
  - 16. A method as in claim 1, wherein the grouping algorithm further uses characteristics of each document for the grouping, the characteristics being selected from a group consisting of:
    - document metadata, location, and date.
  - 17. A method as in claim 1, further comprising:
    - identifying grouping errors using confidence level testing.
  - 18. A method as in claim 17, wherein the confidence level testing comprises a Z-test.

19. An article of manufacture comprising:
- computer executable instructions stored on non-transitory computer readable media, which, when executed by a computer, causes the computer to perform operations comprising;
  
  receiving data characterizing each of a plurality of documents within each of a plurality of document sets;
  
  grouping the plurality of documents into a plurality of stacks using one or more grouping algorithms, wherein key words are identified in each document and weights specified by a scorecard scoring model are assigned to variables corresponding to each key word, wherein a scoring algorithm, using corresponding variables and weights, provides a score for each document which is used by the grouping algorithm when grouping the documents;
  
  identifying a prime document for each stack, the prime document including attributes representative of the entire stack;
  
  providing data characterizing documents for each stack including at least the identified prime document to at least one human reviewer;
  
  receiving user-generated input from the human reviewer categorizing each provided document;
  
  providing data characterizing the user-generated input;
  
  evaluating, by at least one data processor, identified grouping errors using at least one of Z-test techniques and multiple regression techniques;
  
  determining, by at least one data processor based on the evaluating, a relative contribution of the variables used by the grouping algorithm to the grouping errors; and
  
  modifying, by at least one data processor, the grouping algorithm so that at least one weight assigned by the grouping algorithm off-sets the relative contribution of the variables used by the grouping algorithm to an error rate, wherein documents in subsequently received sets of documents are grouped using the modified grouping algorithm.
- View Dependent Claims (20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34)
- - 20. An article as in claim 19, wherein categorization of each provided document by the user-generated input is propagated to all documents within the corresponding stack.
  - 21. An article as in claim 20, wherein documents are incrementally added to the document set after the grouping of the plurality of documents into the plurality of stacks, and wherein the operations further comprise:
    - associating the incrementally added documents to one of the plurality of stacks; and
      
      for each stack;
      
      if the stack has already been categorized, adding the corresponding incrementally added documents to the stack and propagating the categorization to the incrementally added documents in such stack;
      
      orif the stack has not been categorized, adding the incrementally added documents to the stack.
  - 22. An article as in claim 20, wherein at least one documents is incrementally added to the document set after the grouping of the plurality of documents into the plurality of stacks, and wherein the operations further comprise:
    - determining that the at least one incrementally added document is not associated with a previously defined stack; and
      
      defining a new stack including the at least one incrementally added document.
  - 23. An article as in claim 20, wherein the operations further comprise:
    - defining hierarchical relationships among the plurality of documents within the set of documents; and
      
      wherein the grouping algorithms take into account the relationships between documents when grouping the plurality of documents into the plurality of stacks.
  - 24. An article as in claim 19, wherein the human reviewer categorizes each provided document in a group of document review categories.
  - 25. An article as in claim 24, wherein the document review categories are selected from a group comprising:
    - relevance, responsiveness, and privilege.
  - 26. An article as in claim 19, wherein the operations further comprise:
    - sending data characterizing supplemental documents within a stack other than the provided documents to at least one human reviewer for quality control.
  - 27. An article as in claim 26, wherein the operations further comprise:
    - selecting supplemental documents whose data is sent to at least one human reviewer for quality control based on an algorithm designed to select documents based on their likelihood to require remediation.
  - 28. An article as in claim 26, wherein the operations further comprise:
    - defining tiers of documents within each stack, and wherein the supplemental documents comprise documents from two or more tiers.
  - 29. An article as in claim 28, wherein the tiers are based on one or more of:
    - document similarity relative to the corresponding prime document, document type, document author, document sender, document recipient.
  - 30. An article as in claim 19, wherein the documents in the stacks are disjoint.
  - 31. An article as in claim 19, wherein providing the data comprises one or more of:
    - displaying the data, transmitting the data to a remote computing system, and persisting the data.
  - 32. An article as in claim 19, wherein the data characterizing documents for each stack provided to the at least one human reviewer comprise reference numbers for such documents.
  - 33. An article as in claim 19, wherein the data characterizing documents for each stack provided to the at least one human reviewer comprise digitally scanned versions of such documents.
  - 34. The article as in claim 19, further comprising:
    - identifying grouping errors using confidence level testing, wherein the confidence level testing comprises a Z-test.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Consilio LLC
Original Assignee
DiscoverReady LLC (Consilio LLC)
Inventors
Barsony, Stephen John, Messing, Yerachmiel Tzvi, Shub, David Matthew, Wagner, James Kenneth Jr.
Primary Examiner(s)
PYO, MONICA M

Application Number

US13/014,643
Publication Number

US 20120191708A1
Time in Patent Office

776 Days
Field of Search

707737-740, 707/752, 707/754, 704/9
US Class Current

707/737
CPC Class Codes

G06F 16/355   Class or cluster creation o...

G06F 16/358   Browsing; Visualisation the...

G06Q 10/00   Administration; Management

Document classification and characterization

First Claim

8 Assignments

0 Petitions

Accused Products

Abstract

Citations

34 Claims

Specification

Solutions

Use Cases

Quick Links

Document classification and characterization

First Claim

8 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

34 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links