Document classification and characterization using human judgment, tiered similarity analysis and language/concept analysis

US 10,467,252 B1
Filed: 01/30/2013
Issued: 11/05/2019
Est. Priority Date: 01/30/2012
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

receiving a corpus of documents;

characterizing similarities among the corpus of documents using at least three similarity algorithms having different similarity criteria, the characterizing comprising;

obtaining contextual characteristics for each of the corpus of documents and associating the contextual characteristics with the corresponding document, the contextual characteristics selected from a group consisting of;

similarity score, type of similarity algorithm used to characterize the document, document family, document type, and metadata describing properties of the document;

first removing a first portion of the corpus of documents based on applying a first similarity algorithm to the corpus of documents;

second removing, after the first removing, a second portion of the corpus of documents based on applying a second similarity algorithm to the corpus of documents; and

third removing, after the first removing and the second removing, a third portion of the corpus of documents based on applying a third similarity algorithm to the corpus of documents, the third similarity algorithm based on a criteria other than that implemented by the first similarity algorithm and the second similarity algorithm, wherein the third similarity algorithm identifies conceptually similar documents in the corpus of documents based on content of each respective document, and wherein the conceptually similar documents are neither exact duplicates nor substantial duplicates;

defining stacks of documents based on pre-defined grouping criteria as applied to the characterized similarities among the corpus of documents, the characterized similarities based on the first removing, the second removing, or the third removing;

identifying, within each stack, a prime document; and

initiating provision of each prime document to at least one human reviewer via a computer-implemented document review and characterization system.

View all claims

9 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems, methods, and articles are provided for characterizing and defining groups within large corpuses of documents using a combination of one or more of human judgment, tiered similarity analysis techniques, and language/concept analysis. Related apparatus, systems, techniques and articles are also described.

159 Citations

14 Claims

1. A method comprising:
- receiving a corpus of documents;
  
  characterizing similarities among the corpus of documents using at least three similarity algorithms having different similarity criteria, the characterizing comprising;
  
  obtaining contextual characteristics for each of the corpus of documents and associating the contextual characteristics with the corresponding document, the contextual characteristics selected from a group consisting of;
  
  similarity score, type of similarity algorithm used to characterize the document, document family, document type, and metadata describing properties of the document;
  
  first removing a first portion of the corpus of documents based on applying a first similarity algorithm to the corpus of documents;
  
  second removing, after the first removing, a second portion of the corpus of documents based on applying a second similarity algorithm to the corpus of documents; and
  
  third removing, after the first removing and the second removing, a third portion of the corpus of documents based on applying a third similarity algorithm to the corpus of documents, the third similarity algorithm based on a criteria other than that implemented by the first similarity algorithm and the second similarity algorithm, wherein the third similarity algorithm identifies conceptually similar documents in the corpus of documents based on content of each respective document, and wherein the conceptually similar documents are neither exact duplicates nor substantial duplicates;
  
  defining stacks of documents based on pre-defined grouping criteria as applied to the characterized similarities among the corpus of documents, the characterized similarities based on the first removing, the second removing, or the third removing;
  
  identifying, within each stack, a prime document; and
  
  initiating provision of each prime document to at least one human reviewer via a computer-implemented document review and characterization system.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The method as in claim 1, wherein at least one of the similarity algorithms identifies exact duplicates in the corpus of documents.
  - 3. The method as in claim 1, wherein at least one of the similarity algorithms identifies substantial duplicates in the corpus of documents.
  - 4. The method as in claim 1, wherein the pre-defined grouping criteria is adjusted based on quality control review of documents within the stack other than the corresponding prime document.
  - 5. The method as in claim 1, further comprising:
    - receiving data characterizing quality control review of at least a portion of the corpus of documents, the received data being used to modify the pre-defined grouping criteria to either increase or decrease one or more similarity metrics.
  - 6. The method as in claim 1, wherein at least a portion of the corpus of documents comprise families of documents, the families of documents having a pre-defined logical interrelation.

7. A method comprising:
- receiving a corpus of documents;
  
  obtaining contextual characteristics for each of the corpus of documents and associating the contextual characteristics with the corresponding document, the contextual characteristics selected from a group consisting of;
  
  similarity score, type of similarity algorithm used to characterize the document, document family, document type, and metadata describing properties of the document;
  
  generating a first subset of the corpus of documents by identifying and characterizing similarities among the corpus of documents based on applying a first similarity algorithm to the corpus of documents;
  
  generating a second subset of the corpus of documents by identifying and characterizing similarities among the first subset of the corpus of documents based on applying a second similarity algorithm to the first subset of the corpus of documents, the second similarity algorithm having a relaxed similarity standard as compared to the first similarity algorithm; and
  
  generating a third subset of the corpus of documents by identifying and characterizing similarities among the corpus of documents based on applying a third similarity algorithm to the second subset of the corpus of documents, the third similarity algorithm having a similarity standard as other than that implemented by the first similarity algorithm and the second similarity algorithm, wherein the third similarity algorithm identifies conceptually similar documents in the corpus of documents based on content of each respective document, and wherein the conceptually similar documents are neither exact duplicates nor substantial duplicates;
  
  defining stacks of documents based on pre-defined grouping criteria as applied to the second subset of the corpus of documents and the third subset of the corpus of documents;
  
  identifying, within each stack, a prime document; and
  
  initiating provision of each prime document to at least one human reviewer via a computer-implemented document review and characterization system.

8. A non-transitory computer program product storing instructions, which when executed by at least one data processor of at least one computing system, result in operations comprising:
- receiving a corpus of documents;
  
  characterizing similarities among the corpus of documents using at least three similarity algorithms having different similarity criteria, the characterizing comprising;
  
  obtaining contextual characteristics for each of the corpus of documents and associating the contextual characteristics with the corresponding document, the contextual characteristics selected from a group consisting of;
  
  similarity score, type of similarity algorithm used to characterize the document, document family, document type, and metadata describing properties of the document;
  
  first removing a first portion of the corpus of documents based on applying a first similarity algorithm to the corpus of documents; and
  
  second removing, after the first removing, a second portion of the corpus of documents based on applying a second similarity algorithm to the corpus of documents; and
  
  third removing, after the first removing and the second removing, a third portion of the corpus of documents based on applying a third similarity algorithm to the corpus of documents, the third similarity algorithm based on a criteria other than that implemented by the first similarity algorithm and the second similarity algorithm, wherein the third similarity algorithm identifies conceptually similar documents in the corpus of documents based on content of each respective document, and wherein the conceptually similar documents are neither exact duplicates nor substantial duplicates;
  
  defining stacks of documents based on pre-defined grouping criteria as applied to the characterized similarities among the corpus of documents, the characterized similarities based on the first removing, the second removing, or the third removing;
  
  identifying, within each stack, a prime document; and
  
  initiating provision of each prime document to at least one human reviewer via a computer-implemented document review and characterization system.
- View Dependent Claims (9, 10, 11, 12, 13)
- - 9. The non-transitory computer program product as in claim 8, wherein at least one of the similarity algorithms identifies exact duplicates in the corpus of documents.
  - 10. The non-transitory computer program product as in claim 8, wherein at least one of the similarity algorithms identifies substantial duplicates in the corpus of documents.
  - 11. The non-transitory computer program product as in claim 8, wherein the pre-defined grouping criteria is adjusted based on quality control review of documents within the stack other than the corresponding prime document.
  - 12. The non-transitory computer program product as in claim 8, wherein the operations further comprise:
    - receiving data characterizing quality control review of at least a portion of the corpus of documents, the received data being used to modify the pre-defined grouping criteria to either increase or decrease one or more similarity metrics.
  - 13. The non-transitory computer program product as in claim 8, wherein at least a portion of the corpus of documents comprise families of documents, the families of documents having a pre-defined logical interrelation.

14. A system comprising:
- at least one data processor; and
  
  memory storing instructions, which when executed by the at least one data processor, result in operations comprising;
  
  receiving a corpus of documents;
  
  characterizing similarities among the corpus of documents using at least three similarity algorithms having different similarity criteria, the characterizing comprising;
  
  obtaining contextual characteristics for each of the corpus of documents and associating the contextual characteristics with the corresponding document, the contextual characteristics selected from a group consisting of;
  
  similarity score, type of similarity algorithm used to characterize the document, document family, document type, and metadata describing properties of the document;
  
  first removing a first portion of the corpus of documents based on applying a first similarity algorithm to the corpus of documents;
  
  second removing, after the first removing, a second portion of the corpus of documents based on applying a second similarity algorithm to the corpus of documents; and
  
  third removing, after the first removing and the second removing, a third portion of the corpus of documents based on applying a third similarity algorithm to the corpus of documents, the third similarity algorithm based on a criteria other than that implemented by the first similarity algorithm and the second similarity algorithm, wherein the third similarity algorithm identifies conceptually similar documents in the corpus of documents based on content of each respective document, and wherein the conceptually similar documents are neither exact duplicates nor substantial duplicates;
  
  defining stacks of documents based on pre-defined grouping criteria as applied to the characterized similarities among the corpus of documents, the characterized similarities based on the first removing, the second removing, or the third removing;
  
  identifying, within each stack, a prime document; and
  
  initiating provision of each prime document to at least one human reviewer via a computer-implemented document review and characterization system.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Consilio LLC
Original Assignee
DiscoverReady LLC (Consilio LLC)
Inventors
Barsony, Stephen John, Messing, Yerachmiel Tzvi, Shub, David Matthew, Richards, Philip L., Schreiber, Stephen H.
Primary Examiner(s)
Pham, Michael

Application Number

US13/754,780
Time in Patent Office

2,470 Days
Field of Search

707737
US Class Current
CPC Class Codes

G06F 16/285 Clustering or classification

G06F 16/35 Clustering; Classification

Document classification and characterization using human judgment, tiered similarity analysis and language/concept analysis

First Claim

9 Assignments

0 Petitions

Accused Products

Abstract

159 Citations

14 Claims

Specification

Solutions

Use Cases

Quick Links

Document classification and characterization using human judgment, tiered similarity analysis and language/concept analysis

First Claim

9 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

159 Citations

14 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links