Assisted learning for document classification
First Claim
Patent Images
1. A method comprising steps of:
- querying a user to identify one or more true positive documents in relation to a sample document;
querying the user to identify a document repository that contains documents similar to the sample document;
implementing a document classification algorithm to;
analyze a collection of documents within the document repository to identify a set of multiple documents corresponding to the sample document, wherein said analyzing comprises parsing the collection of documents within the document repository to the set of documents based on one or more keywords present in the sample document and in each document of the set;
present at least a portion of the set of multiple documents to the user for user classification, wherein said user classification comprises manual classification of each document in the at least a portion of the set of multiple documents as one of (i) a true positive document in relation to the sample document and (ii) a false positive document in relation to the sample document;
calculating a confidence measure based on the user classification of the at least a portion of the set of multiple documents, wherein said confidence measure corresponds to a level of accuracy by which the document classification algorithm detects one or more documents related to the sample document as compared to the user classification;
querying the user as to whether the document classification algorithm is to be deployed based on sufficiency of the calculated confidence measure, as determined by the user; and
deploying the document classification algorithm upon an affirmative response from the user in response to said querying as to whether the document classification algorithm is to be deployed;
wherein the steps are carried out by at least one computer device.
9 Assignments
0 Petitions
Accused Products
Abstract
Methods, apparatus and articles of manufacture for assisted learning for document classification are provided herein. A method includes analyzing a collection of documents within a document repository to identify a set of multiple documents corresponding to a sample document, presenting at least a portion of the set of multiple documents to a user for user classification, and calculating a confidence measure based on the user classification of the at least a portion of the set of multiple documents, wherein said confidence measure corresponds to a level of accuracy by which a document classification algorithm detects one or more documents related to the sample document.
12 Citations
20 Claims
-
1. A method comprising steps of:
-
querying a user to identify one or more true positive documents in relation to a sample document; querying the user to identify a document repository that contains documents similar to the sample document; implementing a document classification algorithm to; analyze a collection of documents within the document repository to identify a set of multiple documents corresponding to the sample document, wherein said analyzing comprises parsing the collection of documents within the document repository to the set of documents based on one or more keywords present in the sample document and in each document of the set; present at least a portion of the set of multiple documents to the user for user classification, wherein said user classification comprises manual classification of each document in the at least a portion of the set of multiple documents as one of (i) a true positive document in relation to the sample document and (ii) a false positive document in relation to the sample document; calculating a confidence measure based on the user classification of the at least a portion of the set of multiple documents, wherein said confidence measure corresponds to a level of accuracy by which the document classification algorithm detects one or more documents related to the sample document as compared to the user classification; querying the user as to whether the document classification algorithm is to be deployed based on sufficiency of the calculated confidence measure, as determined by the user; and deploying the document classification algorithm upon an affirmative response from the user in response to said querying as to whether the document classification algorithm is to be deployed; wherein the steps are carried out by at least one computer device. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. An apparatus comprising:
-
a memory; and at least one processor coupled to the memory and configured to; query a user to identify one or more true positive documents in relation to a sample document; query the user to identify a document repository that contains documents similar to the sample document; implement a document classification algorithm to; analyze a collection of documents within the document repository to identify a set of multiple documents corresponding to the sample document, wherein said analyzing comprises parsing the collection of documents within the document repository to the set of documents based on one or more keywords present in the sample document and in each document of the set; present at least a portion of the set of multiple documents to the user for user classification, wherein said user classification comprises manual classification of each document in the at least a portion of the set of multiple documents as one of (i) a true positive document in relation to the sample document and (ii) a false positive document in relation to the sample document; calculate a confidence measure based on the user classification of the at least a portion of the set of multiple documents, wherein said confidence measure corresponds to a level of accuracy by which the document classification algorithm detects one or more documents related to the sample document as compared to the user classification; query the user as to whether the document classification algorithm is to be deployed based on sufficiency of the calculated confidence measure, as determined by the user; and deploy the document classification algorithm upon an affirmative response from the user in response to said querying as to whether the document classification algorithm is to be deployed.
-
-
13. A method comprising steps of:
-
receiving a sample document from a user, wherein said sample document represents a type of document to be classified by a document classification algorithm; querying the user to identify one or more true positive documents in relation to the sample document; querying the user to identify a document repository that contains documents similar to the sample document; implementing the document classification algorithm to; analyze a collection of documents within the document repository identified by the user to identify a set of multiple documents corresponding to the sample document, wherein said analyzing comprises parsing the collection of documents within the document repository to the set of documents based on one or more keywords present in the sample document and in each document of the set; present at least a portion of the set of multiple documents to the user for manual classification of each of the presented documents as either (i) related to the sample document or (ii) not related to the sample document; calculating a confidence measure based on the manual classification of the at least a portion of the set of multiple documents, wherein said confidence measure corresponds to a level of accuracy by which the document classification algorithm detects one or more documents related to the sample document as compared to the user classification; querying the user as to whether the document classification algorithm is to be deployed based on sufficiency of the calculated confidence measure, as determined by the user; and deploying the document classification algorithm upon an affirmative response from the user in response to said querying as to whether the document classification algorithm is to be deployed; wherein the steps are carried out by at least one computer device. - View Dependent Claims (14, 15, 16, 17, 18, 19, 20)
-
Specification