METHOD AND SYSTEM FOR ACTIVE LEARNING SCREENING PROCESS WITH DYNAMIC INFORMATION MODELING
First Claim
1. A computer-assisted method for screening information in a set of documents, comprising:
- a) creating a set of concepts representing subject matter in the documents;
b) creating a plurality of status labels representing relevancy of the documents to an end use for the set of documents;
c) selecting a first subset of the set of documents;
d) identifying occurrences of concepts from the set of concepts in each document in the first subset based on subject matter in each document;
e) assigning a status label from the plurality of status labels to each document in the first subset based on relevancy of the document to the end use for the document;
f) processing, using a computer, the first subset of the set of documents to learn relationships among the set of concepts and the labels and create a classification model of the relationships;
g) modifying the set of concepts based on the learned relationships among the set of concepts and the status labels;
h) selecting a second subset of the set of documents;
i) assigning a status label from the plurality of status labels to each document in the second subset based on relevancy of the document to the end use for the document;
j) processing, using the computer, the second subset to further learn relationships among the set of concepts and the labels and refine the classification model of the relationships; and
k) if the classification model of the relationships is satisfactory, then screening, using the computer, the set of documents based on the classification model of the relationships, such that a subset of relevant documents is identified.
1 Assignment
0 Petitions
Accused Products
Abstract
Systems and methods consistent with the present invention improve manual screening processes for a group of documents by implementing an active learning screening process with dynamic information modeling. A classification algorithm is trained to recognize the relationships between concept tags applied to a subset of the documents and true or correct utility or relevance ratings applied to the subset of documents. Once adequately trained using several subsets of documents, the classification algorithm may be applied to the entire group of documents, screening out documents that are not relevant or important.
49 Citations
20 Claims
-
1. A computer-assisted method for screening information in a set of documents, comprising:
-
a) creating a set of concepts representing subject matter in the documents; b) creating a plurality of status labels representing relevancy of the documents to an end use for the set of documents; c) selecting a first subset of the set of documents; d) identifying occurrences of concepts from the set of concepts in each document in the first subset based on subject matter in each document; e) assigning a status label from the plurality of status labels to each document in the first subset based on relevancy of the document to the end use for the document; f) processing, using a computer, the first subset of the set of documents to learn relationships among the set of concepts and the labels and create a classification model of the relationships; g) modifying the set of concepts based on the learned relationships among the set of concepts and the status labels; h) selecting a second subset of the set of documents; i) assigning a status label from the plurality of status labels to each document in the second subset based on relevancy of the document to the end use for the document; j) processing, using the computer, the second subset to further learn relationships among the set of concepts and the labels and refine the classification model of the relationships; and k) if the classification model of the relationships is satisfactory, then screening, using the computer, the set of documents based on the classification model of the relationships, such that a subset of relevant documents is identified. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A computer-assisted method for identifying relevant documents among a set of documents, comprising:
-
a) tagging each document in the set of documents with a concept from a set of concepts representing subject matter found in each document; b) selecting a subset of documents from the set of documents; c) assigning a status label to each document in the subset, the status label representing a degree of importance of the document; d) training a classification algorithm using the subset of documents, where the classification algorithm relates the status labels assigned to the documents to the concepts tagged to the documents, using a computer; e) applying the classification algorithm to the subset of documents to create a classification label for each document in the subset, using the computer, the classification label representing a predicted degree of importance of the document; f) comparing the classification labels to the status labels, using the computer; g) if the classification labels substantially correspond to the status labels, then applying the classification algorithm to the set of documents to create a classification label for each document, using the computer; and h) if the classification labels do not substantially correspond to the status labels, then modifying the set of concepts representing subject matter found in the documents and repeating a) through h) using the modified set of concepts. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19)
-
-
20. A system for identifying relevant documents among a set of documents, comprising;
-
a memory device containing document data for each document in the set of documents, concept data associated with the document data, status label data associated with the document data and classification label data associated with the document data; and a processor, communicatively connected to the memory device, that executes code for performing operations comprising; a) for the document data for each document in the set of documents, storing in the memory device concept data corresponding to a set of concepts representing subject matter found in the document; b) selecting a subset of documents from the set of documents; c) for the document data for each document in the subset, storing in the memory device status label data for each document in the subset, the status label data representing a degree of importance of the document; d) executing a classification algorithm that is trained using the subset of documents, where the classification algorithm relates the status label data associated with the document data for each document in the subset to the associated concept data; e) for the document data for each document in the subset, storing in the memory device classification label data for each document in the subset, the classification label data representing a predicted degree of importance of the document; f) for the document data for each document in the subset, comparing the classification label data to the status label data; g) if the classification label data substantially corresponds to the status label data, then executing the classification algorithm using the set of documents as input to create classification label data associated with the document data for each document in the set of documents; and h) if the classification label data does not substantially correspond to the status label data, then modifying the set of concepts representing subject matter found in the documents and repeating b) through h) using the modified set of concepts.
-
Specification