Systems and methods for identifying and categorizing electronic documents through machine learning
First Claim
1. A system for categorizing electronic documents, comprising:
- a memory device that stores a set of instructions;
at least one processor that executes the instructions to;
receive categorizations for electronic documents included in a first seed set, the electronic documents in the first seed set being selected among a corpus of electronic documents;
train a document categorizer on the categorizations using a machine learning algorithm;
categorize the remaining electronic documents in the corpus using the trained document categorizer;
compare one or more metrics associated with performance of the trained document categorizer to a first threshold associated with performance of the trained document categorizer, the one or more metrics being determined based on the categorizations;
in response to determining that the one or more metrics associated with performance of the trained document categorizer do not satisfy the first threshold, automatically;
analyze a portion of electronic documents among the corpus different from the electronic documents included in the first seed set to identify one or more electronic documents of the portion that have been assigned respective categorization metrics satisfying or not satisfying a second threshold, wherein the second threshold is associated with categorizations metrics applicable to individual electronic documents;
designate the one or more electronic documents of the portion as a second seed set; and
provide the second seed set for categorization;
receive categorizations for the electronic documents included in the second seed set;
retrain the document categorizer on the categorized electronic documents included in the second seed set using the machine learning algorithm;
re-categorize the remaining electronic documents in the corpus using the retrained document categorizer;
compare one or more metrics associated with performance of the retrained document categorizer to the first threshold, the one or more metrics being determined based on the re-categorizations of the remaining electronic documents; and
iterate through generating seed sets, retraining the document categorizer, and re-categorizing the remaining electronic documents in the corpus using the retrained document categorizer until the one or more metrics associated with performance of the retrained document categorizer are greater than the first threshold.
8 Assignments
0 Petitions
Accused Products
Abstract
Computer implemented systems and methods are disclosed for identifying and categorizing electronic documents through machine learning. In accordance with some embodiments, a seed set of categorized electronic documents may be used to train a document categorizer based on a machine learning algorithm. The trained document categorizer may categorize electronic documents in a large corpus of electronic documents. Performance metrics associated with performance of the trained document categorizer may be tracked, and additional seed sets of categorized electronic documents may be used to improve the performance of document categorizer by retraining the document categorizer on subsequent seed sets. Additional seed sets may and categorizations may be iterated through until a desired document categorization performance is reached.
-
Citations
20 Claims
-
1. A system for categorizing electronic documents, comprising:
-
a memory device that stores a set of instructions; at least one processor that executes the instructions to; receive categorizations for electronic documents included in a first seed set, the electronic documents in the first seed set being selected among a corpus of electronic documents; train a document categorizer on the categorizations using a machine learning algorithm; categorize the remaining electronic documents in the corpus using the trained document categorizer; compare one or more metrics associated with performance of the trained document categorizer to a first threshold associated with performance of the trained document categorizer, the one or more metrics being determined based on the categorizations; in response to determining that the one or more metrics associated with performance of the trained document categorizer do not satisfy the first threshold, automatically; analyze a portion of electronic documents among the corpus different from the electronic documents included in the first seed set to identify one or more electronic documents of the portion that have been assigned respective categorization metrics satisfying or not satisfying a second threshold, wherein the second threshold is associated with categorizations metrics applicable to individual electronic documents; designate the one or more electronic documents of the portion as a second seed set; and provide the second seed set for categorization; receive categorizations for the electronic documents included in the second seed set; retrain the document categorizer on the categorized electronic documents included in the second seed set using the machine learning algorithm; re-categorize the remaining electronic documents in the corpus using the retrained document categorizer; compare one or more metrics associated with performance of the retrained document categorizer to the first threshold, the one or more metrics being determined based on the re-categorizations of the remaining electronic documents; and iterate through generating seed sets, retraining the document categorizer, and re-categorizing the remaining electronic documents in the corpus using the retrained document categorizer until the one or more metrics associated with performance of the retrained document categorizer are greater than the first threshold. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A computer-implemented method for categorizing electronic documents, comprising:
-
receiving categorizations for electronic documents included in a first seed set, the electronic documents in the first seed set being selected among a corpus of electronic documents; training a document categorizer on the categorizations using a machine learning algorithm; categorizing the remaining electronic documents in the corpus using the trained document categorizer; comparing one or more metrics associated with performance of the trained document categorizer to a first threshold associated with performance of the trained document categorizer, the one or more metrics being determined based on the categorizations; in response to determining that the one or more metrics associated with performance of the trained document categorizer do not satisfy the first threshold, automatically; analyzing a portion of electronic documents among the corpus different from the electronic documents included in the first seed set to identify one or more electronic documents of the portion that have been assigned respective categorization metrics satisfying or not satisfying a second threshold, wherein the second threshold is associated with categorizations metrics applicable to individual electronic documents; designating the one or more electronic documents of the portion as a second seed set; and providing the second seed set for categorization; receiving categorizations for the electronic documents included in the second seed set; retraining the document categorizer on the categorized electronic documents included in the second seed set using the machine learning algorithm; re-categorizing the remaining electronic documents in the corpus using the retrained document categorizer; comparing one or more metrics associated with performance of the retrained document categorizer to the first threshold, the one or more metrics being determined based on the re-categorizations of the remaining electronic documents; and iterating through generating seed sets, retraining the document categorizer, and re-categorizing the remaining electronic documents in the corpus using the retrained document categorizer until the one or more metrics associated with performance of the retrained document categorizer are greater than the first threshold. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17)
-
-
18. A non-transitory computer-readable medium storing a set of instructions that, when executed by one or more processors, cause the one or more processors to perform a method of categorizing electronic documents, the method comprising:
-
receiving categorizations for electronic documents included in a first seed set, the electronic documents in the first seed set being selected among a corpus of electronic documents; training a document categorizer on the categorizations using a machine learning algorithm; categorizing the remaining electronic documents in the corpus using the trained document categorizer; comparing one or more metrics associated with performance of the trained document categorizer to a first threshold associated with performance of the trained document categorizer, the one or more metrics being determined based on the categorizations; in response to determining that the one or more metrics associated with performance of the trained document categorizer do not satisfy the first threshold, automatically; analyzing a portion of electronic documents among the corpus different from the electronic documents included in the first seed set to identify one or more electronic documents of the portion that have been assigned respective categorization metrics satisfying or not satisfying a second threshold, wherein the second threshold is associated with categorizations metrics applicable to individual electronic documents; designating the one or more electronic documents of the portion as a second seed set; and providing the second seed set for categorization; receiving categorizations for the electronic documents included in the second seed set; retraining the document categorizer on the categorized electronic documents included in the second seed set using the machine learning algorithm; re-categorizing the remaining electronic documents in the corpus using the retrained document categorizer; comparing one or more metrics associated with performance of the retrained document categorizer to the first threshold, the one or more metrics being determined based on the re-categorizations of the remaining electronic documents; and iterating through generating seed sets, retraining the document categorizer, and re-categorizing the remaining electronic documents in the corpus using the retrained document categorizer until the one or more metrics associated with performance of the retrained document categorizer are greater than the first threshold. - View Dependent Claims (19, 20)
-
Specification