Systems and methods for automatically configuring training data for training machine learning models of a machine learning-based dialogue system including seeding training samples or curating a corpus of training data based on instances of training data identified as anomalous
First Claim
1. A system for identifying anomalous training data samples and intelligently forming a corpus of training data for improving a machine learning-based dialogue system, the system comprising:
- a machine learning-based automated dialogue service implementing by one or more hardware computing servers that;
sources a corpus of raw machine learning training data from one or more sources of training data based on a seeding sample set that includes a plurality of seed training samples;
generates a vector representation for each instance of training data in the corpus of raw machine learning training data;
identifies statistical characteristics of the corpus of raw machine learning training data based on a mapping of the vector representation for each instance of training data within the corpus of raw machine learning training data;
identifies a density of the plurality of distinct instances of training data based on the mapping of the vector representation for each distinct instance of training data within the corpus, wherein the density of the plurality of distinct instances relates to a cluster or a grouping of distinct instances of training data of the corpus of raw machine learning training data in which each distinct instance of training data is within a predetermined distance of another distinct instance of training data within the cluster or the grouping;
sets an anomaly threshold based on identifying an absolute distance value away from a centroid of the density of the plurality of distinct instances, wherein a distal end of the absolute distance falls along an area beyond the density of the plurality of distinct instances;
identifies, as anomalous instances, each of one or more instances of training data of the corpus of raw machine learning training data based on the identified statistical characteristics, whereinidentifying the one or more anomalous instances includes;
identifying a given distinct instance as one of the one or more anomalous instances if a distance value for the given distinct instance away from a centroid of the density satisfies or exceeds the anomaly threshold; and
curates the corpus of raw machine learning training data based on each of the one or more instances of training data identified as anomalous instances.
1 Assignment
0 Petitions
Accused Products
Abstract
A system and method for improving a machine learning-based dialogue system includes: sourcing a corpus of raw machine learning training data from sources of training data based on a plurality of seed training samples, wherein the corpus of raw machine learning training data comprises a plurality of distinct instances of training data; generating a vector representation for each distinct instance of training data; identifying statistical characteristics of the corpus of raw machine learning training data based on a mapping of the vector representation for each distinct instance of training data; identifying anomalous instances of the plurality of distinct instances of training data of the corpus of raw machine learning training data based on the identified statistical characteristics of the corpus; and curating the corpus of raw machine learning training data based on each of the instances of training data identified as anomalous instances.
-
Citations
19 Claims
-
1. A system for identifying anomalous training data samples and intelligently forming a corpus of training data for improving a machine learning-based dialogue system, the system comprising:
a machine learning-based automated dialogue service implementing by one or more hardware computing servers that; sources a corpus of raw machine learning training data from one or more sources of training data based on a seeding sample set that includes a plurality of seed training samples; generates a vector representation for each instance of training data in the corpus of raw machine learning training data; identifies statistical characteristics of the corpus of raw machine learning training data based on a mapping of the vector representation for each instance of training data within the corpus of raw machine learning training data; identifies a density of the plurality of distinct instances of training data based on the mapping of the vector representation for each distinct instance of training data within the corpus, wherein the density of the plurality of distinct instances relates to a cluster or a grouping of distinct instances of training data of the corpus of raw machine learning training data in which each distinct instance of training data is within a predetermined distance of another distinct instance of training data within the cluster or the grouping; sets an anomaly threshold based on identifying an absolute distance value away from a centroid of the density of the plurality of distinct instances, wherein a distal end of the absolute distance falls along an area beyond the density of the plurality of distinct instances; identifies, as anomalous instances, each of one or more instances of training data of the corpus of raw machine learning training data based on the identified statistical characteristics, wherein identifying the one or more anomalous instances includes; identifying a given distinct instance as one of the one or more anomalous instances if a distance value for the given distinct instance away from a centroid of the density satisfies or exceeds the anomaly threshold; and curates the corpus of raw machine learning training data based on each of the one or more instances of training data identified as anomalous instances. - View Dependent Claims (2)
-
3. A method for identifying anomalous training data samples and intelligently forming a corpus of training data for improving a machine learning-based dialogue system, the method comprising:
-
sourcing a corpus of raw machine learning training data from one or more sources of training data based on a plurality of seed training samples, wherein the corpus of raw machine learning training data comprises a plurality of distinct instances of training data; generating a vector representation for each distinct instance of training data in the corpus of raw machine learning training data; identifying statistical characteristics of the corpus of raw machine learning training data based on a mapping of the vector representation for each distinct instance of training data within the corpus of raw machine learning training data; identifying a density of the plurality of distinct instances of training data based on the mapping of the vector representation for each distinct instance of training data within the corpus, wherein the density of the plurality of distinct instances relates to a cluster or a grouping of distinct instances of training data of the corpus of raw machine learning training data in which each distinct instance of training data is within a predetermined distance of another distinct instance of training data within the cluster or the grouping; setting an anomaly threshold based on identifying an absolute distance value away from a centroid of the density of the plurality of distinct instances, wherein a distal end of the absolute distance falls along an area beyond the density of the plurality of distinct instances; identifying one or more anomalous instances of the plurality of distinct instances of training data of the corpus of raw machine learning training data based on the identified statistical characteristics of the corpus, wherein identifying the one or more anomalous instances includes; identifying a given distinct instance as one of the one or more anomalous instances if a distance value for the given distinct instance away from a centroid of the density satisfies or exceeds the anomaly threshold; and curating the corpus of raw machine learning training data based on each of the one or more instances of training data identified as anomalous instances. - View Dependent Claims (4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
-
Specification