×

Systems and methods for identifying and categorizing electronic documents through machine learning

  • US 9,514,414 B1
  • Filed: 04/01/2016
  • Issued: 12/06/2016
  • Est. Priority Date: 12/11/2015
  • Status: Active Grant
First Claim
Patent Images

1. A system for categorizing electronic documents, comprising:

  • a memory device that stores a set of instructions;

    at least one processor that executes the instructions to;

    receive categorizations for electronic documents included in a first seed set, the electronic documents in the first seed set being selected among a corpus of electronic documents;

    train a document categorizer on the categorizations using a machine learning algorithm;

    categorize the remaining electronic documents in the corpus using the trained document categorizer;

    compare one or more metrics associated with performance of the trained document categorizer to a first threshold associated with performance of the trained document categorizer, the one or more metrics being determined based on the categorizations;

    in response to determining that the one or more metrics associated with performance of the trained document categorizer do not satisfy the first threshold, automatically;

    analyze a portion of electronic documents among the corpus different from the electronic documents included in the first seed set to identify one or more electronic documents of the portion that have been assigned respective categorization metrics satisfying or not satisfying a second threshold, wherein the second threshold is associated with categorizations metrics applicable to individual electronic documents;

    designate the one or more electronic documents of the portion as a second seed set; and

    provide the second seed set for categorization;

    receive categorizations for the electronic documents included in the second seed set;

    retrain the document categorizer on the categorized electronic documents included in the second seed set using the machine learning algorithm;

    re-categorize the remaining electronic documents in the corpus using the retrained document categorizer;

    compare one or more metrics associated with performance of the retrained document categorizer to the first threshold, the one or more metrics being determined based on the re-categorizations of the remaining electronic documents; and

    iterate through generating seed sets, retraining the document categorizer, and re-categorizing the remaining electronic documents in the corpus using the retrained document categorizer until the one or more metrics associated with performance of the retrained document categorizer are greater than the first threshold.

View all claims
  • 8 Assignments
Timeline View
Assignment View
    ×
    ×