Systems and methods for classifying electronic information using advanced active learning techniques
First Claim
Patent Images
1. A system for classifying documents in a document collection as relevant or non-relevant in connection with conducting e-discovery in a legal proceeding, the system comprising:
- a memory configured to store the document collection;
a computing device coupled to the memory, the computing device comprising;
a display;
a physical input interface;
a processor coupled to the display and the input interface, the processor being configured to;
generate a document information profile for the documents in the collection, each document information profile corresponding to a particular document and representing features and related metadata of that document and no other document;
select a document from the collection to present to a human reviewer;
display a portion of the selected document on the display;
receive, through the input interface, one or more user coding decisions associated with the selected document;
update a classifier using at least one received user coding decision and the document information profile for the document associated with the at least one received user coding decision, wherein the classifier is updated using an incremental learning technique;
compute a set of scores for the documents in the collection by applying the updated classifier to the document information profile associated with each document to be scored;
estimate a number of relevant documents in the document collection by (i) fitting scores computed for documents for which user coding decisions were received to a standard distribution curve, and (ii) calculating an area beneath the curve in order to determine whether review is complete by comparing the estimate to a number of documents in the document collection that the user coded as relevant and that were used to update the classifier;
indicate on the display statistics pertaining to the extent to which review is complete;
in response to determining that review is not complete, repeat the steps of selecting a document, displaying a portion of the selected document, receiving one or more user coding decisions associated with the selected document, updating a classifier, computing a set of scores, and estimating a number of relevant documents; and
classify documents in the document collection as relevant or non-relevant to the legal proceeding using the computed scores or the received user coding decisions.
0 Assignments
0 Petitions
Accused Products
Abstract
Systems and methods for classifying electronic information or documents into a number of classes and subclasses are provided through an active learning algorithm. Such document classification systems are easily scalable for large document collections, require less manpower and can be employed on a single computer, thus requiring fewer resources. Furthermore, the classification systems and methods described can be used for any pattern recognition or classification effort in a wide variety of fields, including electronic discovery in legal proceedings.
151 Citations
20 Claims
-
1. A system for classifying documents in a document collection as relevant or non-relevant in connection with conducting e-discovery in a legal proceeding, the system comprising:
-
a memory configured to store the document collection; a computing device coupled to the memory, the computing device comprising; a display; a physical input interface; a processor coupled to the display and the input interface, the processor being configured to; generate a document information profile for the documents in the collection, each document information profile corresponding to a particular document and representing features and related metadata of that document and no other document; select a document from the collection to present to a human reviewer; display a portion of the selected document on the display; receive, through the input interface, one or more user coding decisions associated with the selected document; update a classifier using at least one received user coding decision and the document information profile for the document associated with the at least one received user coding decision, wherein the classifier is updated using an incremental learning technique; compute a set of scores for the documents in the collection by applying the updated classifier to the document information profile associated with each document to be scored; estimate a number of relevant documents in the document collection by (i) fitting scores computed for documents for which user coding decisions were received to a standard distribution curve, and (ii) calculating an area beneath the curve in order to determine whether review is complete by comparing the estimate to a number of documents in the document collection that the user coded as relevant and that were used to update the classifier; indicate on the display statistics pertaining to the extent to which review is complete; in response to determining that review is not complete, repeat the steps of selecting a document, displaying a portion of the selected document, receiving one or more user coding decisions associated with the selected document, updating a classifier, computing a set of scores, and estimating a number of relevant documents; and classify documents in the document collection as relevant or non-relevant to the legal proceeding using the computed scores or the received user coding decisions. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
-
Specification