Automatic Grammar Augmentation For Robust Voice Command Recognition

US 20200135179A1
Filed: 10/28/2019
Published: 04/30/2020
Est. Priority Date: 10/27/2018
Status: Active Grant

First Claim

Patent Images

1. A method of improving voice command recognition, comprising:

applying an acoustic model to a general speech dataset to generate a statistical pronunciation dictionary;

generating an augmented grammar candidate set based on the statistical pronunciation dictionary and an original grammar set, wherein;

the original grammar set comprises voice commands to be recognized; and

each element of the augmented grammar candidate set comprises a variation of one of the voice commands to be recognized; and

generating an augmented grammar set by adding one or more elements of the augmented grammar candidate set to the original grammar set.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Various embodiments include methods and devices for implementing automatic grammar augmentation for improving voice command recognition accuracy in systems with a small footprint acoustic model. Alternative expressions that may capture acoustic model decoding variations may be added to a grammar set. An acoustic model-specific statistical pronunciation dictionary may be derived by running the acoustic model through a large general speech dataset and constructing a command-specific candidate set containing potential grammar expressions. Greedy based and cross-entropy-method (CEM) based algorithms may be utilized to search the candidate set for augmentations with improved recognition accuracy.

Citations

22 Claims

1. A method of improving voice command recognition, comprising:
- applying an acoustic model to a general speech dataset to generate a statistical pronunciation dictionary;
  
  generating an augmented grammar candidate set based on the statistical pronunciation dictionary and an original grammar set, wherein;
  
  the original grammar set comprises voice commands to be recognized; and
  
  each element of the augmented grammar candidate set comprises a variation of one of the voice commands to be recognized; and
  
  generating an augmented grammar set by adding one or more elements of the augmented grammar candidate set to the original grammar set.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1, wherein applying an acoustic model to a general speech dataset to generate a statistical pronunciation dictionary comprises:
    - obtaining a greedy decoding sequence for each utterance in the general speech dataset;
      
      calculating a minimum-edit-path of a corresponding ground-truth to the greedy decoding sequence; and
      
      obtaining a mapping of each utterance to a corresponding maximum probability decoding.
  - 3. The method of claim 1, wherein applying an acoustic model to a general speech dataset to generate a statistical pronunciation dictionary comprises:
    - obtaining one or more sequences from beam-searching algorithms;
      
      calculating a minimum-edit-path of a corresponding ground-truth to the obtained one or more sequences; and
      
      obtaining a mapping of each utterance to a corresponding maximum probability decoding.
  - 4. The method of claim 1, further comprising determining a command recognition accuracy, a false-alarm rate, and a mis-detection rate for a given augmented grammar candidate set.
  - 5. The method of claim 4, wherein determining a command recognition accuracy, a false-alarm rate, and a mis-detection rate for a given augmented grammar candidate set comprises:
    - determining the command recognition accuracy for each element of the augmented grammar candidate set based on a command-specific data set, the command-specific data set comprising, for each element of the augmented grammar candidate set, an audio waveform and a corresponding target command; and
      
      determining the false-alarm rate for each element of the augmented grammar candidate set based on an out-of-domain data set comprising a set of utterances that do not correspond to any one of the voice commands to be recognized.
  - 6. The method of claim 1, wherein generating an augmented grammar set by adding one or more elements of the augmented grammar candidate set to the original grammar set comprises selecting the one or more elements of the augmented grammar candidate set utilizing a method selected from one of:
    - a naï
      
      ve greedy search;
      
      a greedy search with refinement;
      
      ora beam-search.
  - 7. The method of claim 1, wherein generating an augmented grammar set by adding one or more elements of the augmented grammar candidate set to the original grammar set comprises selecting the one or more elements of the augmented grammar candidate set utilizing a cross entropy method.

8. A computing device, comprising:
- a memory; and
  
  a processor coupled to the memory and configured with processor executable instructions to perform operations comprising;
  
  applying an acoustic model to a general speech dataset to generate a statistical pronunciation dictionary;
  
  generating an augmented grammar candidate set based on the statistical pronunciation dictionary and an original grammar set, wherein;
  
  the original grammar set comprises voice commands to be recognized; and
  
  each element of the augmented grammar candidate set comprises a variation of one of the voice commands to be recognized; and
  
  generating an augmented grammar set by adding one or more elements of the augmented grammar candidate set to the original grammar set.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The computing device of claim 8, wherein the processor is configured with processor-executable instructions to perform operations such that applying an acoustic model to a general speech dataset to generate a statistical pronunciation dictionary comprises:
    - obtaining a greedy decoding sequence for each utterance in the general speech dataset;
      
      calculating a minimum-edit-path of a corresponding ground-truth to the greedy decoding sequence; and
      
      obtaining a mapping of each utterance to a corresponding maximum probability decoding.
  - 10. The computing device of claim 8, wherein the processor is configured with processor-executable instructions to perform operations such that applying an acoustic model to a general speech dataset to generate a statistical pronunciation dictionary comprises:
    - obtaining one or more sequences from beam-searching algorithms;
      
      calculating a minimum-edit-path of a corresponding ground-truth to the obtained one or more sequences; and
      
      obtaining a mapping of each utterance to a corresponding maximum probability decoding.
  - 11. The computing device of claim 8, wherein the processor is configured with processor-executable instructions to perform operations further comprising determining a command recognition accuracy, a false-alarm rate, and a mis-detection rate for a given augmented grammar candidate set.
  - 12. The computing device of claim 11, wherein the processor is configured with processor-executable instructions to perform operations such that determining a command recognition accuracy, a false-alarm rate, and a mis-detection rate for a given augmented grammar candidate set comprises:
    - determining the command recognition accuracy for each element of the augmented grammar candidate set based on a command-specific data set, the command-specific data set comprising, for each element of the augmented grammar candidate set, an audio waveform and a corresponding target command; and
      
      determining the false-alarm rate for each element of the augmented grammar candidate set based on an out-of-domain data set comprising a set of utterances that do not correspond to any one of the voice commands to be recognized.
  - 13. The computing device of claim 8, wherein the processor is configured with processor-executable instructions to perform operations such that generating an augmented grammar set by adding one or more elements of the augmented grammar candidate set to the original grammar set comprises selecting the one or more elements of the augmented grammar candidate set utilizing a method selected from one of:
    - a naï
      
      ve greedy search;
      
      a greedy search with refinement;
      
      ora beam-search.
  - 14. The computing device of claim 8, wherein the processor is configured with processor-executable instructions to perform operations such that:
    - generating an augmented grammar set by adding one or more elements of the augmented grammar candidate set to the original grammar set comprises selecting the one or more elements of the augmented grammar candidate set utilizing a cross entropy method.

15. A computing device, comprising:
- a memory;
  
  means for applying an acoustic model to a general speech dataset to generate a statistical pronunciation dictionary;
  
  means for generating an augmented grammar candidate set based on the statistical pronunciation dictionary and an original grammar set, wherein;
  
  the original grammar set comprises voice commands to be recognized; and
  
  each element of the augmented grammar candidate set comprises a variation of one of the voice commands to be recognized; and
  
  means for generating an augmented grammar set by adding one or more elements of the augmented grammar candidate set to the original grammar set.

16. A non-transitory processor-readable medium having stored thereon processor-executable instructions configured to cause a processor to perform operations comprising:
- applying an acoustic model to a general speech dataset to generate a statistical pronunciation dictionary;
  
  generating an augmented grammar candidate set based on the statistical pronunciation dictionary and an original grammar set, wherein;
  
  the original grammar set comprises voice commands to be recognized; and
  
  each element of the augmented grammar candidate set comprises a variation of one of the voice commands to be recognized; and
  
  generating an augmented grammar set by adding one or more elements of the augmented grammar candidate set to the original grammar set.
- View Dependent Claims (17, 18, 19, 20, 21, 22)
- - 17. The non-transitory processor-readable medium of claim 16, wherein the stored processor-executable instructions are configured to cause a processor to perform operations such that applying an acoustic model to a general speech dataset to generate a statistical pronunciation dictionary comprises:
    - obtaining a greedy decoding sequence for each utterance in the general speech dataset;
      
      calculating a minimum-edit-path of a corresponding ground-truth to the greedy decoding sequence; and
      
      obtaining a mapping of each utterance to a corresponding maximum probability decoding.
  - 18. The non-transitory processor-readable medium of claim 16, wherein the stored processor-executable instructions are configured to cause a processor to perform operations such that applying an acoustic model to a general speech dataset to generate a statistical pronunciation dictionary comprises:
    - obtaining one or more sequences from beam-searching algorithms;
      
      calculating a minimum-edit-path of a corresponding ground-truth to the obtained one or more sequences; and
      
      obtaining a mapping of each utterance to a corresponding maximum probability decoding.
  - 19. The non-transitory processor-readable medium of claim 16, wherein the stored processor-executable instructions are configured to cause a processor to perform operations determining a command recognition accuracy, a false-alarm rate, and a mis-detection rate for a given augmented grammar candidate set.
  - 20. The non-transitory processor-readable medium of claim 19, wherein the stored processor-executable instructions are configured to cause a processor to perform operations such that determining a command recognition accuracy, a false-alarm rate, and a mis-detection rate for a given augmented grammar candidate set comprises:
    - determining the command recognition accuracy for each element of the augmented grammar candidate set based on a command-specific data set, the command-specific data set comprising, for each element of the augmented grammar candidate set, an audio waveform and a corresponding target command; and
      
      determining the false-alarm rate for each element of the augmented grammar candidate set based on an out-of-domain data set comprising a set of utterances that do not correspond to any one of the voice commands to be recognized.
  - 21. The non-transitory processor-readable medium of claim 16, wherein the stored processor-executable instructions are configured to cause a processor to perform operations such that generating an augmented grammar set by adding one or more elements of the augmented grammar candidate set to the original grammar set comprises selecting the one or more elements of the augmented grammar candidate set utilizing a method selected from one of:
    - a naï
      
      ve greedy search;
      
      a greedy search with refinement;
      
      ora beam-search.
  - 22. The non-transitory processor-readable medium of claim 16, wherein the stored processor-executable instructions are configured to cause a processor to perform operations such that generating an augmented grammar set by adding one or more elements of the augmented grammar candidate set to the original grammar set comprises selecting the one or more elements of the augmented grammar candidate set utilizing a cross entropy method.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Qualcomm, Inc.
Original Assignee
Qualcomm, Inc.
Inventors
Yang, Yang, Lalitha, Anusha, Lee, Jin Won, Lott, Christopher

Granted Patent

US 11,282,512 B2
Time in Patent Office

Days
Field of Search
US Class Current
CPC Class Codes

G06F 3/167   Audio in a user interface, ...

G10L 15/02   Feature extraction for spee...

G10L 15/063   Training

G10L 15/07   to the speaker

G10L 15/16   using artificial neural net...

G10L 15/187   Phonemic context, e.g. pron...

G10L 15/19   Grammatical context, e.g. d...

G10L 15/22   Procedures used during a sp...

Automatic Grammar Augmentation For Robust Voice Command Recognition

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

22 Claims

Specification

Solutions

Use Cases

Quick Links

Automatic Grammar Augmentation For Robust Voice Command Recognition

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

22 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links