Systems and methods for classifying electronic information using advanced active learning techniques
First Claim
1. A system for classifying documents in a document collection into one or more classes or subclasses using a continuous active learning process for the purpose of conducting e-discovery in legal proceedings, the system comprising:
- a memory adapted to store the document collection;
a computing device coupled to the memory, the computing device comprising;
a display;
a physical input interface;
a processor coupled to the display and the input interface, the processor being adapted to;
generate a document information profile for the documents in the collection, each document information profile corresponding to a particular document and representing features of that document;
select a document from the collection to present to a human reviewer;
display a portion of the selected document on the display;
receive, through the input interface, one or more user coding decisions associated with the selected document;
for at least one class or subclass, incrementally update a classifier using at least one received user coding decision and the document information profile for the document associated with the at least one received user coding decision;
for at least one classifier, compute a set of scores for the documents in the collection by applying the at least one classifier to the document information profile associated with each document to be scored;
for at least one class or subclass, estimate the number of documents in that class or subclass by fitting the scores calculated using the classifier that corresponds to that class or subclass to a standard distribution;
validate at least one of the estimates using the received user coding decisions;
in response to determining that one of the estimates is valid, indicate, on the display or the input interface, that the review is complete for the class or subclass associated with that estimate;
classify documents in the document collection into the classes or subclasses using the scores and the received user coding decisions; and
repeat the steps of selecting a document, receiving user coding decisions associated with the selected document, calculating a classifier, computing a set of scores,estimating the number of documents in at least one class or subclass, and validating at least one estimate.
0 Assignments
0 Petitions
Accused Products
Abstract
Systems and methods for classifying electronic information or documents into a number of classes and subclasses are provided through an active learning algorithm. In certain embodiments, seed sets may be eliminated by merging relevance feedback and machine learning phases. In certain embodiments, the active learning algorithm forks a number of classification paths corresponding to predicted user coding decisions for a selected document. The active learning algorithm determines an order in which the documents of the collection may be processed and scored by the forked classification paths. Such document classification systems are easily scalable for large document collections, require less manpower and can be employed on a single computer, thus requiring fewer resources. Furthermore, the classification systems and methods described can be used for any pattern recognition or classification effort in a wide variety of fields.
-
Citations
10 Claims
-
1. A system for classifying documents in a document collection into one or more classes or subclasses using a continuous active learning process for the purpose of conducting e-discovery in legal proceedings, the system comprising:
-
a memory adapted to store the document collection; a computing device coupled to the memory, the computing device comprising; a display; a physical input interface; a processor coupled to the display and the input interface, the processor being adapted to; generate a document information profile for the documents in the collection, each document information profile corresponding to a particular document and representing features of that document; select a document from the collection to present to a human reviewer; display a portion of the selected document on the display; receive, through the input interface, one or more user coding decisions associated with the selected document; for at least one class or subclass, incrementally update a classifier using at least one received user coding decision and the document information profile for the document associated with the at least one received user coding decision; for at least one classifier, compute a set of scores for the documents in the collection by applying the at least one classifier to the document information profile associated with each document to be scored; for at least one class or subclass, estimate the number of documents in that class or subclass by fitting the scores calculated using the classifier that corresponds to that class or subclass to a standard distribution; validate at least one of the estimates using the received user coding decisions; in response to determining that one of the estimates is valid, indicate, on the display or the input interface, that the review is complete for the class or subclass associated with that estimate; classify documents in the document collection into the classes or subclasses using the scores and the received user coding decisions; and repeat the steps of selecting a document, receiving user coding decisions associated with the selected document, calculating a classifier, computing a set of scores, estimating the number of documents in at least one class or subclass, and validating at least one estimate. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
Specification