Systems and methods for identifying and categorizing electronic documents through machine learning

US 9,514,414 B1
Filed: 04/01/2016
Issued: 12/06/2016
Est. Priority Date: 12/11/2015
Status: Active Grant

First Claim

Patent Images

1. A system for categorizing electronic documents, comprising:

a memory device that stores a set of instructions;

at least one processor that executes the instructions to;

receive categorizations for electronic documents included in a first seed set, the electronic documents in the first seed set being selected among a corpus of electronic documents;

train a document categorizer on the categorizations using a machine learning algorithm;

categorize the remaining electronic documents in the corpus using the trained document categorizer;

compare one or more metrics associated with performance of the trained document categorizer to a first threshold associated with performance of the trained document categorizer, the one or more metrics being determined based on the categorizations;

in response to determining that the one or more metrics associated with performance of the trained document categorizer do not satisfy the first threshold, automatically;

analyze a portion of electronic documents among the corpus different from the electronic documents included in the first seed set to identify one or more electronic documents of the portion that have been assigned respective categorization metrics satisfying or not satisfying a second threshold, wherein the second threshold is associated with categorizations metrics applicable to individual electronic documents;

designate the one or more electronic documents of the portion as a second seed set; and

provide the second seed set for categorization;

receive categorizations for the electronic documents included in the second seed set;

retrain the document categorizer on the categorized electronic documents included in the second seed set using the machine learning algorithm;

re-categorize the remaining electronic documents in the corpus using the retrained document categorizer;

compare one or more metrics associated with performance of the retrained document categorizer to the first threshold, the one or more metrics being determined based on the re-categorizations of the remaining electronic documents; and

iterate through generating seed sets, retraining the document categorizer, and re-categorizing the remaining electronic documents in the corpus using the retrained document categorizer until the one or more metrics associated with performance of the retrained document categorizer are greater than the first threshold.

View all claims

8 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Computer implemented systems and methods are disclosed for identifying and categorizing electronic documents through machine learning. In accordance with some embodiments, a seed set of categorized electronic documents may be used to train a document categorizer based on a machine learning algorithm. The trained document categorizer may categorize electronic documents in a large corpus of electronic documents. Performance metrics associated with performance of the trained document categorizer may be tracked, and additional seed sets of categorized electronic documents may be used to improve the performance of document categorizer by retraining the document categorizer on subsequent seed sets. Additional seed sets may and categorizations may be iterated through until a desired document categorization performance is reached.

Citations

20 Claims

1. A system for categorizing electronic documents, comprising:
- a memory device that stores a set of instructions;
  
  at least one processor that executes the instructions to;
  
  receive categorizations for electronic documents included in a first seed set, the electronic documents in the first seed set being selected among a corpus of electronic documents;
  
  train a document categorizer on the categorizations using a machine learning algorithm;
  
  categorize the remaining electronic documents in the corpus using the trained document categorizer;
  
  compare one or more metrics associated with performance of the trained document categorizer to a first threshold associated with performance of the trained document categorizer, the one or more metrics being determined based on the categorizations;
  
  in response to determining that the one or more metrics associated with performance of the trained document categorizer do not satisfy the first threshold, automatically;
  
  analyze a portion of electronic documents among the corpus different from the electronic documents included in the first seed set to identify one or more electronic documents of the portion that have been assigned respective categorization metrics satisfying or not satisfying a second threshold, wherein the second threshold is associated with categorizations metrics applicable to individual electronic documents;
  
  designate the one or more electronic documents of the portion as a second seed set; and
  
  provide the second seed set for categorization;
  
  receive categorizations for the electronic documents included in the second seed set;
  
  retrain the document categorizer on the categorized electronic documents included in the second seed set using the machine learning algorithm;
  
  re-categorize the remaining electronic documents in the corpus using the retrained document categorizer;
  
  compare one or more metrics associated with performance of the retrained document categorizer to the first threshold, the one or more metrics being determined based on the re-categorizations of the remaining electronic documents; and
  
  iterate through generating seed sets, retraining the document categorizer, and re-categorizing the remaining electronic documents in the corpus using the retrained document categorizer until the one or more metrics associated with performance of the retrained document categorizer are greater than the first threshold.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The system of claim 1, wherein the categorizations include at least one of:
    - relevant or not relevant.
  - 3. The system of claim 1, wherein the categorizations include at least one of:
    - confidential, not confidential, privileged, or not privileged.
  - 4. The system of claim 1, wherein the second seed set is selected based on the one or more metrics including a number of electronic documents of the corpus assigned a categorization with a threshold level confidence.
  - 5. The system of claim 1, wherein a categorization metric comprises an importance weight, and wherein the at least one processor executes the instructions to further:
    - further in response to determining that the one or more metrics associated with performance of the trained document categorizer do not satisfy the first threshold, automatically;
      
      assign respective importance weights to the electronic documents, the importance weights indicating an importance of the categorization of the respective electronic documents to the performance of the document categorizer.
  - 6. The system of claim 5, wherein the respective importance weights are determined based on a number of electronic documents in the corpus that share similar characteristics with the respective electronic documents.
  - 7. The system of claim 5, wherein a categorization metric comprises a confidence modifier, and wherein the at least one processor executes the instructions to further:
    - further in response to determining that the one or more metrics associated with performance of the trained document categorizer do not satisfy the first threshold, automatically;
      
      assign respective confidence modifiers to the electronic documents, the confidence modifiers indicating a confidence that the categorization of the respective electronic documents assigned by the document categorizer is the correct categorization.
  - 8. The system of claim 1, wherein the machine learning algorithm includes an importance weighted active learning algorithm.
  - 9. The system of claim 1, wherein the first seed set includes an electronic document model created by a user.

10. A computer-implemented method for categorizing electronic documents, comprising:
- receiving categorizations for electronic documents included in a first seed set, the electronic documents in the first seed set being selected among a corpus of electronic documents;
  
  training a document categorizer on the categorizations using a machine learning algorithm;
  
  categorizing the remaining electronic documents in the corpus using the trained document categorizer;
  
  comparing one or more metrics associated with performance of the trained document categorizer to a first threshold associated with performance of the trained document categorizer, the one or more metrics being determined based on the categorizations;
  
  in response to determining that the one or more metrics associated with performance of the trained document categorizer do not satisfy the first threshold, automatically;
  
  analyzing a portion of electronic documents among the corpus different from the electronic documents included in the first seed set to identify one or more electronic documents of the portion that have been assigned respective categorization metrics satisfying or not satisfying a second threshold, wherein the second threshold is associated with categorizations metrics applicable to individual electronic documents;
  
  designating the one or more electronic documents of the portion as a second seed set; and
  
  providing the second seed set for categorization;
  
  receiving categorizations for the electronic documents included in the second seed set;
  
  retraining the document categorizer on the categorized electronic documents included in the second seed set using the machine learning algorithm;
  
  re-categorizing the remaining electronic documents in the corpus using the retrained document categorizer;
  
  comparing one or more metrics associated with performance of the retrained document categorizer to the first threshold, the one or more metrics being determined based on the re-categorizations of the remaining electronic documents; and
  
  iterating through generating seed sets, retraining the document categorizer, and re-categorizing the remaining electronic documents in the corpus using the retrained document categorizer until the one or more metrics associated with performance of the retrained document categorizer are greater than the first threshold.
- View Dependent Claims (11, 12, 13, 14, 15, 16, 17)
- - 11. The method of claim 10, wherein the categorizations include at least one of:
    - relevant or not relevant.
  - 12. The method of claim 10, wherein the categorizations include at least one of:
    - confidential, not confidential, privileged, or not privileged.
  - 13. The method of claim 10, wherein the second seed set is selected based on the one or more metrics including a number of electronic documents of the corpus assigned a categorization with a threshold level confidence.
  - 14. The method of claim 10, wherein a categorization metric comprises an importance weight, and wherein the method further comprises:
    - further in response to determining that the one or more metrics associated with performance of the trained document categorizer do not satisfy the first threshold, automatically;
      
      assigning respective importance weights to the electronic documents, the importance weights indicating an importance of the categorization of the respective electronic documents to the performance of the document categorizer.
  - 15. The method of claim 14, wherein the respective importance weights are determined based on a number of electronic documents in the corpus that share similar characteristics with the respective electronic documents.
  - 16. The method of claim 14, wherein a categorization metric comprises a confidence modifier, and wherein the method further comprises:
    - further in response to determining that the one or more metrics associated with performance of the trained document categorizer do not satisfy the first threshold, automatically;
      
      assigning respective confidence modifiers to the electronic documents, the confidence modifiers indicating a confidence that the categorization of the respective electronic documents assigned by the document categorizer is the correct categorization.
  - 17. The method of claim 10, wherein the machine learning algorithm includes an importance weighted active learning algorithm.

18. A non-transitory computer-readable medium storing a set of instructions that, when executed by one or more processors, cause the one or more processors to perform a method of categorizing electronic documents, the method comprising:
- receiving categorizations for electronic documents included in a first seed set, the electronic documents in the first seed set being selected among a corpus of electronic documents;
  
  training a document categorizer on the categorizations using a machine learning algorithm;
  
  categorizing the remaining electronic documents in the corpus using the trained document categorizer;
  
  comparing one or more metrics associated with performance of the trained document categorizer to a first threshold associated with performance of the trained document categorizer, the one or more metrics being determined based on the categorizations;
  
  in response to determining that the one or more metrics associated with performance of the trained document categorizer do not satisfy the first threshold, automatically;
  
  analyzing a portion of electronic documents among the corpus different from the electronic documents included in the first seed set to identify one or more electronic documents of the portion that have been assigned respective categorization metrics satisfying or not satisfying a second threshold, wherein the second threshold is associated with categorizations metrics applicable to individual electronic documents;
  
  designating the one or more electronic documents of the portion as a second seed set; and
  
  providing the second seed set for categorization;
  
  receiving categorizations for the electronic documents included in the second seed set;
  
  retraining the document categorizer on the categorized electronic documents included in the second seed set using the machine learning algorithm;
  
  re-categorizing the remaining electronic documents in the corpus using the retrained document categorizer;
  
  comparing one or more metrics associated with performance of the retrained document categorizer to the first threshold, the one or more metrics being determined based on the re-categorizations of the remaining electronic documents; and
  
  iterating through generating seed sets, retraining the document categorizer, and re-categorizing the remaining electronic documents in the corpus using the retrained document categorizer until the one or more metrics associated with performance of the retrained document categorizer are greater than the first threshold.
- View Dependent Claims (19, 20)
- - 19. The non-transitory computer-readable medium of claim 18, wherein a categorization metric comprises an importance weight, and wherein the method further comprises:
    - further in response to determining that the one or more metrics associated with performance of the trained document categorizer do not satisfy the first threshold, automatically;
      
      assigning respective importance weights to the electronic documents, the importance weights indicating an importance of the categorization of the respective electronic documents to the performance of the document categorizer.
  - 20. The non-transitory computer-readable medium of claim 19, wherein the respective importance weights are determined based on a number of electronic documents in the corpus that share similar characteristics with the respective electronic documents.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Palantir Technologies Incorporated
Original Assignee
Palantir Technologies Incorporated
Inventors
Rosswog, James, Gerhardt, Matthew, Raboin, Eric, Grossman, Jack, Simons, Kevin, Levan, Matthew, Klein, Nathaniel, Beiermeister, Ryan, O'Brien, Tim, Erenrich, Daniel, Bogomolov, Arseny, Bills, Cooper, Anderson, Eric
Primary Examiner(s)
Chaki, Kakali
Assistant Examiner(s)
Sitiriche, Luis

Application Number

US15/088,481
Time in Patent Office

249 Days
Field of Search

None
US Class Current

1/1
CPC Class Codes

G06F 16/285   Clustering or classification

G06F 16/93   Document management systems

G06N 20/00   Machine learning

G06N 5/04   Inference or reasoning models

G06N 7/01   Probabilistic graphical mod...

Systems and methods for identifying and categorizing electronic documents through machine learning

First Claim

8 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Systems and methods for identifying and categorizing electronic documents through machine learning

First Claim

8 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links