Systems and methods for classifying electronic information using advanced active learning techniques
First Claim
1. An active learning system for classifying documents in a document collection as a member of one or more classes or subclasses, the system comprising:
- a processor being adapted to;
select a document from the document collection;
calculate at least two predicted classifiers for at least one of the one or more classes or subclasses, each predicted classifier being calculated using a document information profile for the selected document, a current classifier associated with at least one of the one or more classes or subclasses, and a different coding decision selected from a set of possible user coding decisions to be received from a user, thereby resulting in a plurality of predicted classifiers each one corresponding to a different user coding decision;
determine a processing order for a subset of documents in the document collection that indicates an order in which the documents of the subset are to be scored;
for each one of the predicted classifiers, calculate a set of scores for one or more documents in the document collection, at least in part, according to the processing order, wherein each score is generated for a document by utilizing the corresponding predicted classifier and a document information profile of the document to be scored;
receive a user coding decision;
determine whether one or more stopping criteria have been met using a subset of the set of scores based on the predicted classifier that corresponds to the received user coding decision, wherein determining whether one or more stopping criteria have been met includes selecting and presenting documents from the document collection to a user and calculating an estimate of system effectiveness using the user coding decisions for the selected documents;
so long as the one or more stopping criteria have not been met, select a further document from the document collection and repeat the steps of calculating predicted classifiers, determining a processing order, calculating a set of scores, and classifying a set of documents based on the selected further documents; and
in response to determining whether one or more stopping criteria have been met, classify a set of documents in the document collection into one or more of the one or more classes or subclasses using a subset of the set of scores based on the predicted classifier that corresponds to the received user coding decision.
0 Assignments
0 Petitions
Accused Products
Abstract
Systems and methods for classifying electronic information or documents into a number of classes and subclasses are provided through an active learning algorithm. In certain embodiments, seed sets may be eliminated by merging relevance feedback and machine learning phases. In certain embodiments, the active learning algorithm forks a number of classification paths corresponding to predicted user coding decisions for a selected document. The active learning algorithm determines an order in which the documents of the collection may be processed and scored by the forked classification paths. Such document classification systems are easily scalable for large document collections, require less manpower and can be employed on a single computer, thus requiring fewer resources. Furthermore, the classification systems and methods described can be used for any pattern recognition or classification effort in a wide variety of fields.
136 Citations
47 Claims
-
1. An active learning system for classifying documents in a document collection as a member of one or more classes or subclasses, the system comprising:
a processor being adapted to; select a document from the document collection; calculate at least two predicted classifiers for at least one of the one or more classes or subclasses, each predicted classifier being calculated using a document information profile for the selected document, a current classifier associated with at least one of the one or more classes or subclasses, and a different coding decision selected from a set of possible user coding decisions to be received from a user, thereby resulting in a plurality of predicted classifiers each one corresponding to a different user coding decision; determine a processing order for a subset of documents in the document collection that indicates an order in which the documents of the subset are to be scored; for each one of the predicted classifiers, calculate a set of scores for one or more documents in the document collection, at least in part, according to the processing order, wherein each score is generated for a document by utilizing the corresponding predicted classifier and a document information profile of the document to be scored; receive a user coding decision; determine whether one or more stopping criteria have been met using a subset of the set of scores based on the predicted classifier that corresponds to the received user coding decision, wherein determining whether one or more stopping criteria have been met includes selecting and presenting documents from the document collection to a user and calculating an estimate of system effectiveness using the user coding decisions for the selected documents; so long as the one or more stopping criteria have not been met, select a further document from the document collection and repeat the steps of calculating predicted classifiers, determining a processing order, calculating a set of scores, and classifying a set of documents based on the selected further documents; and in response to determining whether one or more stopping criteria have been met, classify a set of documents in the document collection into one or more of the one or more classes or subclasses using a subset of the set of scores based on the predicted classifier that corresponds to the received user coding decision. - View Dependent Claims (9, 10, 11, 15, 16, 17, 18, 19, 20, 21, 22)
-
2. An active learning system for classifying documents in a document collection as a member of one or more classes or subclasses, the system comprising:
a processor being adapted to; select a document from the document collection; calculate at least two predicted classifiers for at least one of the one or more classes or subclasses, each predicted classifier being calculated using a document information profile for the selected document, a current classifier associated with at least one of the one or more classes or subclasses, and a different coding decision selected from a set of possible user coding decisions to be received from a user, thereby resulting in a plurality of predicted classifiers each one corresponding to a different user coding decision; determine a processing order for a subset of documents in the document collection that indicates an order in which the documents of the subset are to be scored; for each one of the predicted classifiers, calculate a set of scores for one or more documents in the document collection, at least in part, according to the processing order, wherein each score is generated for a document by utilizing the corresponding predicted classifier and a document information profile of the document to be scored; receive a user coding decision; determine whether one or more stopping criteria have been met using a subset of the set of scores based on the predicted classifier that corresponds to the received user coding decision, wherein determining whether one or more stopping criteria have been met includes calculating an estimate of the number of documents in the document collection that are relevant to one of the classes or subclasses by fitting the subset of the set of scores to a standard distribution and validating the estimate using a sequence of user coding decisions; so long as the one or more stopping criteria have not been met, select a further document from the document collection and repeat the steps of calculating predicted classifiers, determining a processing order, calculating a set of scores, and classifying a set of documents based on the selected further documents; and in response to determining whether one or more stopping criteria have been met, classify a set of documents in the document collection into one or more of the one or more classes or subclasses using a subset of the set of scores based on the predicted classifier that corresponds to the received user coding decision. - View Dependent Claims (3)
-
4. An active learning system for classifying documents in a document collection as a member of one or more classes or subclasses, the system comprising:
a processor being adapted to; receive or provide relevance rankings, wherein the relevance rankings are generated by one or more keyword searching algorithms or by a comparison with one or more exemplary documents; select a document from the document collection; calculate at least two predicted classifiers for at least one of the one or more classes or subclasses, each predicted classifier being calculated using a document information profile for the selected document, a current classifier associated with at least one of the one or more classes or subclasses, and a different coding decision selected from a set of possible user coding decisions to be received from a user, thereby resulting in a plurality of predicted classifiers each one corresponding to a different user coding decision; determine a processing order for a subset of documents in the document collection that indicates an order in which the documents of the subset are to be scored, wherein the processing order is derived, at least in part, from a set of scores calculated using the current classifier and the relevance rankings; for each one of the predicted classifiers, calculate a set of scores for one or more documents in the document collection, at least in part, according to the processing order, wherein each score is generated for a document by utilizing the corresponding predicted classifier and a document information profile of the document to be scored; and in response to receiving a user coding decision, classify a set of documents in the document collection into one or more of the one or more classes or subclasses using a subset of the set of scores based on the predicted classifier that corresponds to the received user coding decision. - View Dependent Claims (5, 6, 7, 8)
-
12. An active learning system for classifying documents in a document collection as a member of one or more classes or subclasses, the system comprising:
a processor being adapted to; pre-process documents from the document collection to reduce the dimensionality of document information profiles, wherein pre-processing the documents includes converting characters of the document to a common case or compressing strings of non-alphanumeric characters to a single character; select a document from the document collection; calculate at least two predicted classifiers for at least one of the one or more classes or subclasses, each predicted classifier being calculated using a document information profile for the selected document, a current classifier associated with at least one of the one or more classes or subclasses, and a different coding decision selected from a set of possible user coding decisions to be received from a user, thereby resulting in a plurality of predicted classifiers each one corresponding to a different user coding decision; determine a processing order for a subset of documents in the document collection that indicates an order in which the documents of the subset are to be scored; for each one of the predicted classifiers, calculate a set of scores for one or more documents in the document collection, at least in part, according to the processing order, wherein each score is generated for a document by utilizing the corresponding predicted classifier and a document information profile of the document to be scored; and in response to receiving a user coding decision, classify a set of documents in the document collection into one or more of the one or more classes or subclasses using a subset of the set of scores based on the predicted classifier that corresponds to the received user coding decision.
-
13. An active learning for classifying documents in a document collection as a member of one or more classes or subclasses, the system comprising:
a processor being adapted to; extract document information profiles from the documents in the document collection using an N-gram technique and hash N-grams to reduce dimensionalities of document information profiles, wherein the N-grams are hashed to balance frequencies of N-grams assigned to hash values; select a document from the document collection; calculate at least two predicted classifiers for at least one of the one or more classes or subclasses, each predicted classifier being calculated using a document information profile for the selected document, a current classifier associated with at least one of the one or more classes or subclasses, and a different coding decision selected from a set of possible user coding decisions to be received from a user, thereby resulting in a plurality of predicted classifiers each one corresponding to a different user coding decision; determine a processing order for a subset of documents in the document collection that indicates an order in which the documents of the subset are to be scored; for each one of the predicted classifiers, calculate a set of scores for one or more documents in the document collection, at least in part, according to the processing order, wherein each score is generated for a document by utilizing the corresponding predicted classifier and a document information profile of the document to be scored; and in response to receiving a user coding decision, classify a set of documents in the document collection into one or more of the one or more classes or subclasses using a subset of the set of scores based on the predicted classifier that corresponds to the received user coding decision. - View Dependent Claims (14)
-
23. A non-transitory computer storage medium comprising program instructions for classifying documents in a document collection as a member of one or more classes or subclasses, wherein the program instructions, when executed on a processor, cause the processor to:
-
select a document from the document collection; calculate at least two predicted classifiers for at least one of the one or more classes or subclasses, each predicted classifier being calculated using a document information profile for the selected document, a current classifier associated with at least one of the one or more classes or subclasses, and a different coding decision selected from a set of possible user coding decisions to be received from a user, thereby resulting in a plurality of predicted classifiers each one corresponding to a different user coding decision; determine a processing order for a subset of documents in the document collection that indicates an order in which the documents of the subset are to be scored; for each one of the predicted classifiers, calculate a set of scores for one or more documents in the document collection, at least in part, according to the processing order, wherein each score is generated for a document by utilizing the corresponding predicted classifier and a document information profile of the document to be scored; receive a user coding decision; determine whether one or more stopping criteria have been met using a subset of the set of scores based on the predicted classifier that corresponds to the received user coding decision, wherein determining whether one or more stopping criteria have been met includes selecting and presenting documents from the document collection to a user and calculating an estimate of system effectiveness using the user coding decisions for the selected documents; so long as the one or more stopping criteria have not been met, select a further document from the document collection and repeat the steps of calculating predicted classifiers, determining a processing order, calculating a set of scores, and classifying a set of documents based on the selected further documents; and in response to determining whether one or more stopping criteria have been met, classify a set of documents in the document collection into one or more of the one or more classes or subclasses using a subset of the set of scores based on the predicted classifier that corresponds to the received user coding decision. - View Dependent Claims (24, 25)
-
-
26. A method for classifying documents in a document collection as a member of one or more classes or subclasses, the method comprising:
-
selecting a document from the document collection; calculating at least two predicted classifiers for at least one of the one or more classes or subclasses, each predicted classifier being calculated using a document information profile for the selected document, a current classifier associated with at least one of the one or more classes or subclasses, and a different coding decision selected from a set of possible user coding decisions to be received from a user, thereby resulting in a plurality of predicted classifiers each one corresponding to a different user coding decision; determining a processing order for a subset of documents in the document collection that indicates an order in which the documents of the subset are to be scored; for each one of the predicted classifiers, calculating a set of scores for one or more documents in the document collection, at least in part, according to the processing order, wherein each score is generated for a document by utilizing the corresponding predicted classifier and a document information profile of the document to be scored; receiving a user coding decision; determining whether one or more stopping criteria have been met using a subset of the set of scores based on the predicted classifier that corresponds to the received user coding decision, wherein determining whether one or more stopping criteria have been met includes calculating an estimate of the number of documents in the document collection that are relevant to one of the classes or subclasses by fitting the subset of the set of scores to a standard distribution and validating the estimate using a sequence of user coding decisions; so long as the one or more stopping criteria have not been met, selecting a further document from the document collection and repeating the steps of calculating predicted classifiers, determining a processing order, calculating a set of scores, and classifying a set of documents based on the selected further documents; and in response to determining whether one or more stopping criteria have been met, classifying a set of documents in the document collection into one or more of the one or more classes or subclasses using a subset of the set of scores based on the predicted classifier that corresponds to the received user coding decision.
-
-
27. A method for processing documents in a document collection using a continuous active learning algorithm, the method comprising:
-
generating or receiving a document information profile for one or more of the documents in the document collection, each document information profile corresponding to a particular document and representing features of that document; selecting a first document from the document collection to present to a human reviewer; receiving a user coding decision associated with the first document; calculating a classifier based on the received user coding decision associated with the first document and the document information profile for the first document; computing a first set of scores for a subset of documents in the collection by applying the classifier to the document information profile for each document in the subset for which the first set of scores are computed; determining a first plurality of rankings for documents in the document collection by choosing between the first set of scores, relevance rankings derived from user input for the documents in the document collection or a combination of the first set of scores and the relevance rankings; selecting a further document from the document collection to present to a human reviewer based on the first plurality of rankings; receiving a user coding decision associated with the further document; updating the classifier based on the received user coding decision associated with the further document and the document information profile for the further document; computing a second set of scores for a subset of documents in the collection by applying the updated classifier to the document information profile for each document in the subset for which the second set of scores are computed; determining a second plurality of rankings for documents in the document collection by choosing between the second set of scores, relevance rankings derived from user input for the documents in the document collection or a combination of the second set of scores and the relevance rankings; determining whether a stopping criterion is met based on the user coding decisions associated with the further document or based on one or more thresholds set using the computed second set of scores; and in response to determining that the stopping criterion has not been met, repeating the steps of selecting a further document, receiving a user coding decision associated with the further document, updating the classifier, computing a second set of scores, determining a second plurality of rankings, and determining whether a stopping criterion is met. - View Dependent Claims (28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45)
-
-
46. A continuous active learning system for processing documents in a document collection the system comprising:
a processor being adapted to; generate or receive a document information profile for one or more of the documents in the document collection, each document information profile corresponding to a particular document and representing features of that document; select a first document from the document collection to present to a human reviewer; receive a user coding decision associated with the first document; calculate a classifier based on the received user coding decision associated with the first document and the document information profile for the first document; compute a first set of scores for a subset of documents in the collection by applying the classifier to the document information profile for each document in the subset for which the first set of scores are computed; determine a first plurality of rankings for documents in the document collection by choosing between the first set of scores, relevance rankings derived from user input for the documents in the document collection or a combination of the first set of scores and the relevance rankings; select a further document from the document collection to present to a human reviewer based on the first plurality of rankings; receive a user coding decision associated with the further document; update the classifier based on the received user coding decision associated with the further document and the document information profile for the further document; compute a second set of scores for a subset of documents in the collection by applying the updated classifier to the document information profile for each document in the subset for which the second set of scores are computed; determine a second plurality of rankings for documents in the document collection by choosing between the second set of scores, relevance rankings derived from user input for the documents in the document collection or a combination of the second set of scores and the relevance rankings; determine whether a stopping criterion is met based on the user coding decisions associated with the further document or based on one or more thresholds set using the computed second set of scores; and in response to determining that the stopping criterion has not been met, repeat the steps of selecting a further document, receiving a user coding decision associated with the further document, updating the classifier, computing a second set of scores, determining a second plurality of rankings, and determining whether a stopping criterion is met.
-
47. A non-transitory computer storage medium comprising program instructions for processing documents in a document collection using a continuous active learning algorithm, wherein the program instructions, when executed on a processor, cause the processor to:
-
generate or receive a document information profile for one or more of the documents in the document collection, each document information profile corresponding to a particular document and representing features of that document; select a first document from the document collection to present to a human reviewer; calculate a classifier based on a received user coding decision associated with the first document and the document information profile for the first document; compute a first set of scores for a subset of documents in the collection by applying the classifier to the document information profile for each document in the subset for which the first set of scores are computed; determine a first plurality of rankings for documents in the document collection by choosing between the first set of scores, relevance rankings derived from user input for the documents in the document collection or a combination of the first set of scores and the relevance rankings; select a further document from the document collection to present to a human reviewer based on the first plurality of rankings; update the classifier based on a received user coding decision associated with the further document and the document information profile for the further document; compute a second set of scores for a subset of documents in the collection by applying the updated classifier to the document information profile for each document in the subset for which the second set of scores are computed; determine a second plurality of rankings for documents in the document collection by choosing between the second set of scores, relevance rankings derived from user input for the documents in the document collection or a combination of the second set of scores and the relevance rankings; determine whether a stopping criterion is met based on the user coding decisions associated with the further document or based on one or more thresholds set using the computed second set of scores; and in response to determining that the stopping criterion has not been met, repeat the steps of selecting a further document, receiving a user coding decision associated with the further document, updating the classifier, computing a second set of scores, determining a second plurality of rankings, and determining whether a stopping criterion is met.
-
Specification