Document classification and characterization using human judgment, tiered similarity analysis and language/concept analysis
First Claim
Patent Images
1. A method comprising:
- receiving a corpus of documents;
characterizing similarities among the corpus of documents using at least three similarity algorithms having different similarity criteria, the characterizing comprising;
obtaining contextual characteristics for each of the corpus of documents and associating the contextual characteristics with the corresponding document, the contextual characteristics selected from a group consisting of;
similarity score, type of similarity algorithm used to characterize the document, document family, document type, and metadata describing properties of the document;
first removing a first portion of the corpus of documents based on applying a first similarity algorithm to the corpus of documents;
second removing, after the first removing, a second portion of the corpus of documents based on applying a second similarity algorithm to the corpus of documents; and
third removing, after the first removing and the second removing, a third portion of the corpus of documents based on applying a third similarity algorithm to the corpus of documents, the third similarity algorithm based on a criteria other than that implemented by the first similarity algorithm and the second similarity algorithm, wherein the third similarity algorithm identifies conceptually similar documents in the corpus of documents based on content of each respective document, and wherein the conceptually similar documents are neither exact duplicates nor substantial duplicates;
defining stacks of documents based on pre-defined grouping criteria as applied to the characterized similarities among the corpus of documents, the characterized similarities based on the first removing, the second removing, or the third removing;
identifying, within each stack, a prime document; and
initiating provision of each prime document to at least one human reviewer via a computer-implemented document review and characterization system.
9 Assignments
0 Petitions
Accused Products
Abstract
Systems, methods, and articles are provided for characterizing and defining groups within large corpuses of documents using a combination of one or more of human judgment, tiered similarity analysis techniques, and language/concept analysis. Related apparatus, systems, techniques and articles are also described.
159 Citations
14 Claims
-
1. A method comprising:
-
receiving a corpus of documents; characterizing similarities among the corpus of documents using at least three similarity algorithms having different similarity criteria, the characterizing comprising; obtaining contextual characteristics for each of the corpus of documents and associating the contextual characteristics with the corresponding document, the contextual characteristics selected from a group consisting of;
similarity score, type of similarity algorithm used to characterize the document, document family, document type, and metadata describing properties of the document;first removing a first portion of the corpus of documents based on applying a first similarity algorithm to the corpus of documents; second removing, after the first removing, a second portion of the corpus of documents based on applying a second similarity algorithm to the corpus of documents; and third removing, after the first removing and the second removing, a third portion of the corpus of documents based on applying a third similarity algorithm to the corpus of documents, the third similarity algorithm based on a criteria other than that implemented by the first similarity algorithm and the second similarity algorithm, wherein the third similarity algorithm identifies conceptually similar documents in the corpus of documents based on content of each respective document, and wherein the conceptually similar documents are neither exact duplicates nor substantial duplicates; defining stacks of documents based on pre-defined grouping criteria as applied to the characterized similarities among the corpus of documents, the characterized similarities based on the first removing, the second removing, or the third removing; identifying, within each stack, a prime document; and initiating provision of each prime document to at least one human reviewer via a computer-implemented document review and characterization system. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A method comprising:
-
receiving a corpus of documents; obtaining contextual characteristics for each of the corpus of documents and associating the contextual characteristics with the corresponding document, the contextual characteristics selected from a group consisting of;
similarity score, type of similarity algorithm used to characterize the document, document family, document type, and metadata describing properties of the document;generating a first subset of the corpus of documents by identifying and characterizing similarities among the corpus of documents based on applying a first similarity algorithm to the corpus of documents; generating a second subset of the corpus of documents by identifying and characterizing similarities among the first subset of the corpus of documents based on applying a second similarity algorithm to the first subset of the corpus of documents, the second similarity algorithm having a relaxed similarity standard as compared to the first similarity algorithm; and generating a third subset of the corpus of documents by identifying and characterizing similarities among the corpus of documents based on applying a third similarity algorithm to the second subset of the corpus of documents, the third similarity algorithm having a similarity standard as other than that implemented by the first similarity algorithm and the second similarity algorithm, wherein the third similarity algorithm identifies conceptually similar documents in the corpus of documents based on content of each respective document, and wherein the conceptually similar documents are neither exact duplicates nor substantial duplicates; defining stacks of documents based on pre-defined grouping criteria as applied to the second subset of the corpus of documents and the third subset of the corpus of documents; identifying, within each stack, a prime document; and initiating provision of each prime document to at least one human reviewer via a computer-implemented document review and characterization system.
-
-
8. A non-transitory computer program product storing instructions, which when executed by at least one data processor of at least one computing system, result in operations comprising:
-
receiving a corpus of documents; characterizing similarities among the corpus of documents using at least three similarity algorithms having different similarity criteria, the characterizing comprising; obtaining contextual characteristics for each of the corpus of documents and associating the contextual characteristics with the corresponding document, the contextual characteristics selected from a group consisting of;
similarity score, type of similarity algorithm used to characterize the document, document family, document type, and metadata describing properties of the document;first removing a first portion of the corpus of documents based on applying a first similarity algorithm to the corpus of documents; and second removing, after the first removing, a second portion of the corpus of documents based on applying a second similarity algorithm to the corpus of documents; and third removing, after the first removing and the second removing, a third portion of the corpus of documents based on applying a third similarity algorithm to the corpus of documents, the third similarity algorithm based on a criteria other than that implemented by the first similarity algorithm and the second similarity algorithm, wherein the third similarity algorithm identifies conceptually similar documents in the corpus of documents based on content of each respective document, and wherein the conceptually similar documents are neither exact duplicates nor substantial duplicates; defining stacks of documents based on pre-defined grouping criteria as applied to the characterized similarities among the corpus of documents, the characterized similarities based on the first removing, the second removing, or the third removing; identifying, within each stack, a prime document; and initiating provision of each prime document to at least one human reviewer via a computer-implemented document review and characterization system. - View Dependent Claims (9, 10, 11, 12, 13)
-
-
14. A system comprising:
-
at least one data processor; and memory storing instructions, which when executed by the at least one data processor, result in operations comprising; receiving a corpus of documents; characterizing similarities among the corpus of documents using at least three similarity algorithms having different similarity criteria, the characterizing comprising; obtaining contextual characteristics for each of the corpus of documents and associating the contextual characteristics with the corresponding document, the contextual characteristics selected from a group consisting of;
similarity score, type of similarity algorithm used to characterize the document, document family, document type, and metadata describing properties of the document;first removing a first portion of the corpus of documents based on applying a first similarity algorithm to the corpus of documents; second removing, after the first removing, a second portion of the corpus of documents based on applying a second similarity algorithm to the corpus of documents; and third removing, after the first removing and the second removing, a third portion of the corpus of documents based on applying a third similarity algorithm to the corpus of documents, the third similarity algorithm based on a criteria other than that implemented by the first similarity algorithm and the second similarity algorithm, wherein the third similarity algorithm identifies conceptually similar documents in the corpus of documents based on content of each respective document, and wherein the conceptually similar documents are neither exact duplicates nor substantial duplicates; defining stacks of documents based on pre-defined grouping criteria as applied to the characterized similarities among the corpus of documents, the characterized similarities based on the first removing, the second removing, or the third removing; identifying, within each stack, a prime document; and initiating provision of each prime document to at least one human reviewer via a computer-implemented document review and characterization system.
-
Specification