Systems and methods for automatically configuring training data for training machine learning models of a machine learning-based dialogue system including seeding training samples or curating a corpus of training data based on instances of training data identified as anomalous

US 10,679,150 B1
Filed: 11/20/2019
Issued: 06/09/2020
Est. Priority Date: 12/13/2018
Status: Active Grant

First Claim

Patent Images

1. A system for identifying anomalous training data samples and intelligently forming a corpus of training data for improving a machine learning-based dialogue system, the system comprising:

a machine learning-based automated dialogue service implementing by one or more hardware computing servers that;

sources a corpus of raw machine learning training data from one or more sources of training data based on a seeding sample set that includes a plurality of seed training samples;

generates a vector representation for each instance of training data in the corpus of raw machine learning training data;

identifies statistical characteristics of the corpus of raw machine learning training data based on a mapping of the vector representation for each instance of training data within the corpus of raw machine learning training data;

identifies a density of the plurality of distinct instances of training data based on the mapping of the vector representation for each distinct instance of training data within the corpus, wherein the density of the plurality of distinct instances relates to a cluster or a grouping of distinct instances of training data of the corpus of raw machine learning training data in which each distinct instance of training data is within a predetermined distance of another distinct instance of training data within the cluster or the grouping;

sets an anomaly threshold based on identifying an absolute distance value away from a centroid of the density of the plurality of distinct instances, wherein a distal end of the absolute distance falls along an area beyond the density of the plurality of distinct instances;

identifies, as anomalous instances, each of one or more instances of training data of the corpus of raw machine learning training data based on the identified statistical characteristics, whereinidentifying the one or more anomalous instances includes;

identifying a given distinct instance as one of the one or more anomalous instances if a distance value for the given distinct instance away from a centroid of the density satisfies or exceeds the anomaly threshold; and

curates the corpus of raw machine learning training data based on each of the one or more instances of training data identified as anomalous instances.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method for improving a machine learning-based dialogue system includes: sourcing a corpus of raw machine learning training data from sources of training data based on a plurality of seed training samples, wherein the corpus of raw machine learning training data comprises a plurality of distinct instances of training data; generating a vector representation for each distinct instance of training data; identifying statistical characteristics of the corpus of raw machine learning training data based on a mapping of the vector representation for each distinct instance of training data; identifying anomalous instances of the plurality of distinct instances of training data of the corpus of raw machine learning training data based on the identified statistical characteristics of the corpus; and curating the corpus of raw machine learning training data based on each of the instances of training data identified as anomalous instances.

Citations

19 Claims

1. A system for identifying anomalous training data samples and intelligently forming a corpus of training data for improving a machine learning-based dialogue system, the system comprising:
- a machine learning-based automated dialogue service implementing by one or more hardware computing servers that;
  
  sources a corpus of raw machine learning training data from one or more sources of training data based on a seeding sample set that includes a plurality of seed training samples;
  
  generates a vector representation for each instance of training data in the corpus of raw machine learning training data;
  
  identifies statistical characteristics of the corpus of raw machine learning training data based on a mapping of the vector representation for each instance of training data within the corpus of raw machine learning training data;
  
  identifies a density of the plurality of distinct instances of training data based on the mapping of the vector representation for each distinct instance of training data within the corpus, wherein the density of the plurality of distinct instances relates to a cluster or a grouping of distinct instances of training data of the corpus of raw machine learning training data in which each distinct instance of training data is within a predetermined distance of another distinct instance of training data within the cluster or the grouping;
  
  sets an anomaly threshold based on identifying an absolute distance value away from a centroid of the density of the plurality of distinct instances, wherein a distal end of the absolute distance falls along an area beyond the density of the plurality of distinct instances;
  
  identifies, as anomalous instances, each of one or more instances of training data of the corpus of raw machine learning training data based on the identified statistical characteristics, whereinidentifying the one or more anomalous instances includes;
  
  identifying a given distinct instance as one of the one or more anomalous instances if a distance value for the given distinct instance away from a centroid of the density satisfies or exceeds the anomaly threshold; and
  
  curates the corpus of raw machine learning training data based on each of the one or more instances of training data identified as anomalous instances.
- View Dependent Claims (2)
- - 2. The system according to claim 1, wherein the machine learning-based automated dialogue service further:
    - defines a re-seeding sample set that includes a plurality of re-seeding training samples, wherein each of the plurality of re-seeding training samples comprises a distinct one of the one or more anomalous instances;
      
      sources a re-seeding corpus of raw machine learning training data from one or more sources of training data based on the re-seeding sample set comprising the plurality of re-seeding training samples based on the one or more anomalous instances; and
      
      constructs a joint corpus of training data that includes a synthesis of [i] the corpus of raw machine learning training data and [ii] the re-seeding corpus of raw machine learning training data.

3. A method for identifying anomalous training data samples and intelligently forming a corpus of training data for improving a machine learning-based dialogue system, the method comprising:
- sourcing a corpus of raw machine learning training data from one or more sources of training data based on a plurality of seed training samples, wherein the corpus of raw machine learning training data comprises a plurality of distinct instances of training data;
  
  generating a vector representation for each distinct instance of training data in the corpus of raw machine learning training data;
  
  identifying statistical characteristics of the corpus of raw machine learning training data based on a mapping of the vector representation for each distinct instance of training data within the corpus of raw machine learning training data;
  
  identifying a density of the plurality of distinct instances of training data based on the mapping of the vector representation for each distinct instance of training data within the corpus, wherein the density of the plurality of distinct instances relates to a cluster or a grouping of distinct instances of training data of the corpus of raw machine learning training data in which each distinct instance of training data is within a predetermined distance of another distinct instance of training data within the cluster or the grouping;
  
  setting an anomaly threshold based on identifying an absolute distance value away from a centroid of the density of the plurality of distinct instances, wherein a distal end of the absolute distance falls along an area beyond the density of the plurality of distinct instances;
  
  identifying one or more anomalous instances of the plurality of distinct instances of training data of the corpus of raw machine learning training data based on the identified statistical characteristics of the corpus, whereinidentifying the one or more anomalous instances includes;
  
  identifying a given distinct instance as one of the one or more anomalous instances if a distance value for the given distinct instance away from a centroid of the density satisfies or exceeds the anomaly threshold; and
  
  curating the corpus of raw machine learning training data based on each of the one or more instances of training data identified as anomalous instances.
- View Dependent Claims (4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
- - 4. The method according to claim 3, whereinthe one or more anomalous instances relate to one or more instances of training data identified within the corpus of raw machine learning training data having vector representations that satisfies or exceeds a target threshold based on a mean vector representation of the corpus of raw machine learning training data.
  - 5. The method according to claim 3, whereinthe plurality of seed training samples comprise a plurality of example utterances and/or prompts for a specific dialogue intent of a machine learning-based automated dialogue system.
  - 6. The method according to claim 3, further comprising:
    - defining a re-seeding sample set that includes a plurality of re-seeding training samples, wherein each of the plurality of re-seeding training samples comprises a distinct one of the one or more anomalous instances.
  - 7. The method according to claim 6, further comprising:
    - sourcing a re-seeding corpus of raw machine learning training data from one or more sources of training data based on the re-seeding sample set comprising the plurality of re-seeding training samples based on the one or more anomalous instances.
  - 8. The method according to claim 7, further comprising:
    - constructing a joint corpus of training data that includes a synthesis of [i] the corpus of raw machine learning training data and [ii] the re-seeding corpus of raw machine learning training data.
  - 9. The method according to claim 7, further comprising:
    - calculating one or more efficacy metrics of the joint corpus of training data, wherein calculating the one or more efficacy metrics includes calculating one or more of a coverage metric value and a diversity metric value of the joint corpus of training data;
      
      sourcing additional re-seeding corpora of raw machine learning training data until a coverage metric threshold and/or a diversity metric threshold of a resulting joint corpus of raw machine learning training data is satisfied by one or more of the coverage metric value and the diversity metric value, wherein the resulting joint corpus of raw machine learning training data combines each of the corpus of raw machine learning training data, the re-seeded corpus of raw machine learning training data, and all subsequent re-seeded corpus of raw machine learning training data.
  - 10. The method according to claim 3, wherein:
    - each instance of training data of the corpus of raw machine learning training data comprises a word or a sentence for performing a training of a machine learning model of a machine learning-based automated dialogue system,generating the vector representation for each instance of training data includes;
      
      computing a vector value for each instance of training data using one or more sentence embedding techniques or one or more word embedding techniques.
  - 11. The method according to claim 3, whereinidentifying the statistical characteristics of the corpus of raw machine learning training data includes:
    - computing a centroid of the corpus of raw machine learning training data based on the vector representation for each instance of training data within the corpus of raw machine learning training data.
  - 12. The method according to claim 3, whereinidentifying the statistical characteristics includes:
    - computing a distance value from a centroid of the corpus of raw machine learning training data for each instance of training data within the corpus of raw machine learning training data.
  - 13. The method according to claim 12, further comprising:
    - enumerating each instance of training data of the corpus of raw machine learning training data in an ascending order or a descending order based on the computed distance value for each instance of training data.
  - 14. The method according to claim 13, whereinidentifying one or more anomalous instances of the plurality of distinct instances of training data of the corpus of raw machine learning training data includes:
    - evaluating the computed distance value for each distinct instance of training data against an anomaly threshold.
  - 15. The method according to claim 3, further comprising:
    - generating a graphical representation of the plurality of distinct instances of training data of the corpus of raw machine learning training data based on the vector representation for each distinct instance of training data, wherein identifying the one or more anomalous instances includes;
      
      identifying a given instance as one of the one or more anomalous instances if the given instance is visually distant from a density or a cluster of distinct instances of training data within the corpus.
  - 16. The method according to claim 3, further comprising:
    - evaluating each of the one or more anomalous instances of the corpus of raw machine learning training data, wherein the evaluating includes determining for each of the one or more anomalous instances whether a dialogue intent classification label associated with a respective one of the one or more anomalous instances matches an identified dialogue intent of the respective one of the one or more anomalous instances.
  - 17. The method according to claim 3, further comprising:
    - identifying whether each respective one of the one or more anomalous instances comprises a valid anomalous instance or an invalid anomalous instance, wherein;
      
      (i) a valid anomalous instance relates to an instance of training data that (a) overlaps or shares in a same or a similar semantic meaning as an average training data sample instance from the corpus of raw machine learning training data or (b) overlaps or shares in a same or a similar semantic meaning as a seed training sample of the seeding sample set, and(ii) an invalid anomalous instance relates to an instance of training data that (a) fails to overlap or does not share in a same or a similar semantic meaning as an average training data sample instance from the corpus of raw machine learning training data or (b) fails to overlap or does not share in a same or a similar semantic meaning as a seed training sample of the seeding sample set.
  - 18. The method according to claim 17, whereincurating the corpus of raw machine learning training data includes:
    - if a distinct one of the one or more anomalous instances is identified as the invalid anomalous instance, reducing the corpus of raw machine learning training data by discarding the invalid anomalous instance.
  - 19. The method according to claim 17, further comprising:
    - defining a re-seeding sample set that includes a plurality of valid anomalous instances; and
      
      sourcing a re-seeding corpus of raw machine learning training data from one or more sources of training data based on the plurality of valid anomalous instances.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Clinc, Inc.
Original Assignee
Clinc, Inc.
Inventors
Larson, Stefan, Mahendran, Anish, Lee, Andrew, Kummerfeld, Jonathan K., Hill, Parker, Laurenzano, Michael A., Hauswald, Johann, Tang, Lingjia, Mars, Jason
Primary Examiner(s)
Afshar, Kamran
Assistant Examiner(s)
Buss, Benjamin J

Application Number

US16/689,287
Publication Number

US 20200193331A1
Time in Patent Office

202 Days
Field of Search

None
US Class Current
CPC Class Codes

G06F 17/10   Complex mathematical operat...

G06F 17/18   for evaluating statistical ...

G06F 18/214   Generating training pattern...

G06F 18/217   Validation; Performance eva...

G06F 18/24137   Distances to cluster centroïds

G06F 40/216   using statistical methods

G06F 40/279   Recognition of textual enti...

G06F 40/35   Discourse or dialogue repre...

G06N 20/10   using kernel methods, e.g. ...

G06N 20/20   Ensemble learning

G06N 3/044   Recurrent networks, e.g. Ho...

G06N 5/027   Frames

G06N 5/04   Inference or reasoning models

Systems and methods for automatically configuring training data for training machine learning models of a machine learning-based dialogue system including seeding training samples or curating a corpus of training data based on instances of training data identified as anomalous

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

Systems and methods for automatically configuring training data for training machine learning models of a machine learning-based dialogue system including seeding training samples or curating a corpus of training data based on instances of training data identified as anomalous

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links