×

Systems and methods for automatically configuring training data for training machine learning models of a machine learning-based dialogue system including seeding training samples or curating a corpus of training data based on instances of training data identified as anomalous

  • US 10,679,150 B1
  • Filed: 11/20/2019
  • Issued: 06/09/2020
  • Est. Priority Date: 12/13/2018
  • Status: Active Grant
First Claim
Patent Images

1. A system for identifying anomalous training data samples and intelligently forming a corpus of training data for improving a machine learning-based dialogue system, the system comprising:

  • a machine learning-based automated dialogue service implementing by one or more hardware computing servers that;

    sources a corpus of raw machine learning training data from one or more sources of training data based on a seeding sample set that includes a plurality of seed training samples;

    generates a vector representation for each instance of training data in the corpus of raw machine learning training data;

    identifies statistical characteristics of the corpus of raw machine learning training data based on a mapping of the vector representation for each instance of training data within the corpus of raw machine learning training data;

    identifies a density of the plurality of distinct instances of training data based on the mapping of the vector representation for each distinct instance of training data within the corpus, wherein the density of the plurality of distinct instances relates to a cluster or a grouping of distinct instances of training data of the corpus of raw machine learning training data in which each distinct instance of training data is within a predetermined distance of another distinct instance of training data within the cluster or the grouping;

    sets an anomaly threshold based on identifying an absolute distance value away from a centroid of the density of the plurality of distinct instances, wherein a distal end of the absolute distance falls along an area beyond the density of the plurality of distinct instances;

    identifies, as anomalous instances, each of one or more instances of training data of the corpus of raw machine learning training data based on the identified statistical characteristics, whereinidentifying the one or more anomalous instances includes;

    identifying a given distinct instance as one of the one or more anomalous instances if a distance value for the given distinct instance away from a centroid of the density satisfies or exceeds the anomaly threshold; and

    curates the corpus of raw machine learning training data based on each of the one or more instances of training data identified as anomalous instances.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×