Disambiguation in speech recognition

US 9,484,021 B1
Filed: 03/30/2015
Issued: 11/01/2016
Est. Priority Date: 03/30/2015
Status: Active Grant

First Claim

Patent Images

1. A method for performing automatic speech recognition (ASR) processing on an utterance using trained classifiers, the method comprising:

determining a first classifier, wherein the first classifier is trained to determine, using confidence scores corresponding to a plurality of ASR hypotheses, whether to select a single ASR hypothesis or whether to perform further selection from among the plurality of ASR hypotheses;

determining a second classifier, wherein the second classifier is trained to determine, using the confidence scores, which of the plurality of ASR hypothesis to select for disambiguation;

receiving, from a mobile device, audio data corresponding to an utterance;

performing ASR processing on the audio data to generate a first ASR hypothesis corresponding to a first confidence score, a second ASR hypothesis corresponding to a second confidence score and a third ASR hypothesis corresponding to a third confidence score;

processing, using a processor, the first confidence score, the second confidence score and the third confidence score using the first classifier to output an indication indicating the first ASR hypothesis, the second ASR hypothesis and the third ASR hypothesis for further processing;

processing, using a processor, the first confidence score, the second confidence score and the third confidence score using the second classifier to output an indication that the first ASR hypothesis and second ASR hypothesis are selected for disambiguation;

sending the first ASR hypothesis and the second ASR hypothesis to the mobile device for disambiguation; and

receiving a selection from the mobile device, the selection indicating the first ASR hypothesis.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Automatic speech recognition (ASR) processing including a two-stage configuration. After ASR processing of an incoming utterance where the ASR outputs an N-best list including multiple hypotheses, a first stage determines whether to execute a command associated with one of the hypotheses or whether to output some of the hypotheses of the N-best list for disambiguation. A second stage determines what hypotheses should be included in the disambiguation choices. A first machine learning model is used at the first stage and a second machine learning model is used at the second stage. The multi-stage configuration allows for reduced speech processing errors as well as a reduced number of utterances sent for disambiguation, which thus improves the user experience.

191 Citations

20 Claims

1. A method for performing automatic speech recognition (ASR) processing on an utterance using trained classifiers, the method comprising:
- determining a first classifier, wherein the first classifier is trained to determine, using confidence scores corresponding to a plurality of ASR hypotheses, whether to select a single ASR hypothesis or whether to perform further selection from among the plurality of ASR hypotheses;
  
  determining a second classifier, wherein the second classifier is trained to determine, using the confidence scores, which of the plurality of ASR hypothesis to select for disambiguation;
  
  receiving, from a mobile device, audio data corresponding to an utterance;
  
  performing ASR processing on the audio data to generate a first ASR hypothesis corresponding to a first confidence score, a second ASR hypothesis corresponding to a second confidence score and a third ASR hypothesis corresponding to a third confidence score;
  
  processing, using a processor, the first confidence score, the second confidence score and the third confidence score using the first classifier to output an indication indicating the first ASR hypothesis, the second ASR hypothesis and the third ASR hypothesis for further processing;
  
  processing, using a processor, the first confidence score, the second confidence score and the third confidence score using the second classifier to output an indication that the first ASR hypothesis and second ASR hypothesis are selected for disambiguation;
  
  sending the first ASR hypothesis and the second ASR hypothesis to the mobile device for disambiguation; and
  
  receiving a selection from the mobile device, the selection indicating the first ASR hypothesis.
- View Dependent Claims (2, 3, 4)
- - 2. The method of claim 1, further comprising:
    - training the first classifier using a first plurality of first confidence score groups, wherein;
      
      each first confidence score group is associated with a training utterance,each first confidence score group includes a plurality of confidence scores,each confidence score of the first confidence score group is associated with a training ASR hypothesis corresponding to the respective training utterance, andeach first confidence score group is labeled for either (a) selecting a highest scoring training ASR hypothesis associated with the associated training utterance or (b) selecting a plurality of training ASR hypotheses associated with the training utterance for disambiguation; and
      
      training a second classifier using the second plurality of second confidence score groups, wherein;
      
      each second confidence score group is associated with a candidate training ASR hypothesis,each second confidence score group includes a plurality of confidence scores corresponding to different training ASR hypotheses of a same training utterance, andeach second confidence score group is labeled for either (a) selecting the respective training ASR hypothesis for disambiguation or (b) not selecting the respective training ASR hypothesis for disambiguation.
  - 3. The method of claim 2, further comprising determining the second plurality of second confidence score groups from among the first confidence score groups labeled for (b) selecting a plurality of training ASR hypotheses associated with the training utterance for disambiguation.
  - 4. The method of claim 2, wherein each second confidence score group further comprises an indication of a command result corresponding to the candidate training ASR hypothesis, and wherein training the second classifier comprises considering the command results during training.

5. A method comprising:
- receiving audio data corresponding to an utterance;
  
  performing speech processing on the audio data to generate;
  
  a first hypothesis corresponding to a first feature,a second hypothesis corresponding to a second feature, anda third hypothesis corresponding to a third feature;
  
  processing, using a processor, the first feature, the second feature and the third feature using a first model to identify the first hypothesis, the second hypothesis, and the third hypothesis for further selection; and
  
  processing, using a processor, the first feature, the second feature and the third feature using a second model to select the first hypothesis and second hypothesis for disambiguation.
- View Dependent Claims (6, 7, 8, 9, 10, 11, 12)
- - 6. The method of claim 5, further comprising:
    - sending, to a device, an indication of the first hypothesis and second hypothesis; and
      
      receiving, from the device, an indication of selection of the first hypothesis.
  - 7. The method of claim 6, further comprising:
    - processing text of the first hypothesis to identify a search query;
      
      executing the search query to determine search results; and
      
      sending the search results to the device.
  - 8. The method of claim 5, wherein each of the first model and the second model is one of a trained classifier, neural network or support vector machine.
  - 9. The method of claim 5, wherein each of the first model and the second model is trained using at least one of adaptive boosting, decision trees and random forest.
  - 10. The method of claim 5, wherein the first feature is a first confidence score, the second feature is a second confidence score and the third feature is a third confidence score.
  - 11. The method of claim 5, further comprising:
    - training the first model using a first plurality of first confidence score groups, wherein;
      
      each first confidence score group is associated with a training utterance,each first confidence score group includes a plurality of confidence scores,each confidence score of the first confidence score group is associated with a training ASR hypothesis corresponding to the respective training utterance, andeach first confidence score group is labeled for either (a) selecting a highest scoring training ASR hypothesis associated with the associated training utterance or (b) selecting a plurality of training ASR hypotheses associated with the training utterance for disambiguation; and
      
      training the second model using the second plurality of second confidence score groups, wherein;
      
      each second confidence score group is associated with a candidate training ASR hypothesis,each second confidence score group includes a plurality of confidence scores corresponding to different training ASR hypotheses of a same training utterance, andeach second confidence score group is labeled for either (a) selecting the respective training ASR hypothesis for disambiguation or (b) not selecting the respective training ASR hypothesis for disambiguation.
  - 12. The method of claim 5, wherein:
    - the first hypothesis is a first search request;
      
      the second hypothesis is a second search request;
      
      the third hypothesis is a third search request, andthe method further comprises;
      
      determining first search results corresponding to the first search request;
      
      determining second search results corresponding to the second search request;
      
      determining third search results corresponding to the third search request; and
      
      processing the first search results, the second search results and the third search results using the second model to select the first hypothesis and second hypothesis.the method further comprises processing an indicator of the first command result, an indicator of the second command result, and an indicator of the third command result using the second model to select the first hypothesis and second hypothesis.

13. A computing system comprising:
- at least one processor;
  
  a memory including instructions operable to be executed by the at least one processor to cause the system to perform a set of actions comprising;
  
  receiving audio data corresponding to an utterance;
  
  performing speech processing on the audio data to generate;
  
  a first hypothesis corresponding to a first feature,a second hypothesis corresponding to a second feature, anda third hypothesis corresponding to a third feature;
  
  processing the first feature, the second feature and the third feature using a first model to identify the first hypothesis, the second hypothesis, and the third hypothesis for further selection; and
  
  processing the first feature, the second feature and the third feature using a second model to select the first hypothesis and second hypothesis for disambiguation.
- View Dependent Claims (14, 15, 16, 17, 18, 19, 20)
- - 14. The computing system of claim 13, wherein the set of actions further comprises:
    - sending, to a device, an indication of the first hypothesis and second hypothesis; and
      
      receiving, from the device, an indication of selection of the first hypothesis.
  - 15. The computing system of claim 14, wherein the set of actions further comprise:
    - processing text of the first hypothesis to identify a search query;
      
      executing the search query to determine search results; and
      
      sending the search results to the device.
  - 16. The computing system of claim 13, wherein each of the first model and the second model is one of a trained classifier, neural network or support vector machine.
  - 17. The computing system of claim 13, wherein each of the first model and the second model is trained using at least one of adaptive boosting, decision trees and random forest.
  - 18. The computing system of claim 13, wherein the first feature is a first confidence score, the second feature is a second confidence score and the third feature is a third confidence score.
  - 19. The computing system of claim 13, wherein the set of actions further comprises:
    - training the first model using a first plurality of first confidence score groups, wherein;
      
      each first confidence score group is associated with a training utterance,each first confidence score group includes a plurality of confidence scores,each confidence score of the first confidence score group is associated with a training ASR hypothesis corresponding to the respective training utterance, andeach first confidence score group is labeled for either (a) selecting a highest scoring training ASR hypothesis associated with the associated training utterance or (b) selecting a plurality of training ASR hypotheses associated with the training utterance for disambiguation; and
      
      training the second model using the second plurality of second confidence score groups, wherein;
      
      each second confidence score group is associated with a candidate training ASR hypothesis,each second confidence score group includes a plurality of confidence scores corresponding to different training ASR hypotheses of a same training utterance, andeach second confidence score group is labeled for either (a) selecting the respective training ASR hypothesis for disambiguation or (b) not selecting the respective training ASR hypothesis for disambiguation.
  - 20. The computing system of claim 16, wherein:
    - the first hypothesis is a first search request;
      
      the second hypothesis is a second search request;
      
      the third hypothesis is a third search request, andthe set of actions further comprises;
      
      determining first search results corresponding to the first search request;
      
      determining second search results corresponding to the second search request;
      
      determining third search results corresponding to the third search request; and
      
      processing the first search results, the second search results and the third search results using the second model to select the first hypothesis and second hypothesis.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Original Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Inventors
Mairesse, Francois, Raccuglia, Paul Frederick, Vitaladevuni, Shiv Naga Prasad
Primary Examiner(s)
Hang, Vu B

Application Number

US14/673,089
Time in Patent Office

582 Days
Field of Search

704/231, 704/232, 704/236, 704/237, 704/238, 704/239, 704/240, 704/246, 704/249, 704/250, 704/251, 704/254, 704/255, 704/257, 704/270, 704/275
US Class Current

1/1
CPC Class Codes

G10L 15/00   Speech recognition G10L17/0...

G10L 15/02   Feature extraction for spee...

G10L 15/08   Speech classification or se...

G10L 15/22   Procedures used during a sp...

Disambiguation in speech recognition

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

191 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Disambiguation in speech recognition

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

191 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links