Method and apparatus for evaluating trigger phrase enrollment

US 10,192,548 B2
Filed: 06/02/2017
Issued: 01/29/2019
Est. Priority Date: 07/31/2013
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method comprising:

during a trigger phrase enrollment process;

receiving, at a speech recognition-enabled electronic device, a first audio signal corresponding to a user of the speech recognition-enabled electronic device speaking a trigger phrase, the first audio signal comprising a first number of frames having a measure of noise variability of background noise exceeding a noise variability threshold;

when a count of the first number of frames in the first audio signal satisfies a frame number threshold, prompting, by the speech recognition-enabled electronic device, the user to speak the trigger phrase again;

receiving, by the speech recognition-enabled electronic device, a second audio signal corresponding to the user speaking the trigger phrase again, the second audio signal comprising a second number of frames having the measure of noise variability of background noise exceeding the noise variability threshold; and

when a count of the second number of frames in the second audio signal dissatisfies the frame number threshold, training, by the speech recognition-enabled electronic device, a trigger phrase model with the second audio signal corresponding to the user speaking the trigger phrase again; and

after the trigger phrase enrollment process;

receiving, at the speech recognition-enabled electronic device and while the speech recognition-enabled electronic device is in a sleep mode, a third audio signal including an utterance of the trigger phrase spoken by the user; and

detecting, by the speech recognition-enabled electronic device and using the trigger phrase model trained during the trigger phrase enrollment process, the utterance of the trigger phrase in the third audio signal, the trigger phrase when detected in the third audio signal causing the speech recognition-enabled electronic device to wake from the sleep mode, the sleep mode comprising a power-saving mode of operation in which one or more parts of the speech recognition-enabled electronic device are in a low-power state or powered off.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An electronic device includes a microphone that receives an audio signal that includes a spoken trigger phrase, and a processor that is electrically coupled to the microphone. The processor measures characteristics of the audio signal, and determines, based on the measured characteristics, whether the spoken trigger phrase is acceptable for trigger phrase model training. If the spoken trigger phrase is determined not to be acceptable for trigger phrase model training, the processor rejects the trigger phrase for trigger phrase model training.

132 Citations

20 Claims

1. A computer-implemented method comprising:
- during a trigger phrase enrollment process;
  
  receiving, at a speech recognition-enabled electronic device, a first audio signal corresponding to a user of the speech recognition-enabled electronic device speaking a trigger phrase, the first audio signal comprising a first number of frames having a measure of noise variability of background noise exceeding a noise variability threshold;
  
  when a count of the first number of frames in the first audio signal satisfies a frame number threshold, prompting, by the speech recognition-enabled electronic device, the user to speak the trigger phrase again;
  
  receiving, by the speech recognition-enabled electronic device, a second audio signal corresponding to the user speaking the trigger phrase again, the second audio signal comprising a second number of frames having the measure of noise variability of background noise exceeding the noise variability threshold; and
  
  when a count of the second number of frames in the second audio signal dissatisfies the frame number threshold, training, by the speech recognition-enabled electronic device, a trigger phrase model with the second audio signal corresponding to the user speaking the trigger phrase again; and
  
  after the trigger phrase enrollment process;
  
  receiving, at the speech recognition-enabled electronic device and while the speech recognition-enabled electronic device is in a sleep mode, a third audio signal including an utterance of the trigger phrase spoken by the user; and
  
  detecting, by the speech recognition-enabled electronic device and using the trigger phrase model trained during the trigger phrase enrollment process, the utterance of the trigger phrase in the third audio signal, the trigger phrase when detected in the third audio signal causing the speech recognition-enabled electronic device to wake from the sleep mode, the sleep mode comprising a power-saving mode of operation in which one or more parts of the speech recognition-enabled electronic device are in a low-power state or powered off.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The computer-implemented method of claim 1, wherein the count of the first number of frames satisfies the frame number threshold when the first number of frames is greater than or equal to the frame number threshold, and wherein the count of the second number of frames dissatisfies the frame number threshold when the second number of frames is less than the frame number threshold.
  - 3. The computer-implemented method of claim 1, further comprising:
    - determining, by the speech recognition-enabled electronic device, the measure of noise variability of the background noise for each frame in the received first audio signal;
      
      comparing, by the speech recognition-enabled electronic device, the determined measure of noise variability of the background noise to the noise variability threshold; and
      
      incrementing, by the speech recognition-enabled electronic device, a counter in response to determining the determined measure of noise variability of the background noise is greater than the noise variability threshold.
  - 4. The computer-implemented method of claim 1, further comprising:
    - determining, by the speech recognition-enabled electronic device, the measure of noise variability of the background noise for each frame in the received second audio signal;
      
      comparing, by the speech recognition-enabled electronic device, the determined measure of noise variability of the background noise to the noise variability threshold; and
      
      incrementing, by the speech recognition-enabled electronic device, a counter in response to determining the determined measure of noise variability of the background noise is greater than the noise variability threshold.
  - 5. The computer-implemented method of claim 4, wherein determining the measure of the noise variability of the background noise for each frame in the received second audio signal comprises:
    - obtaining, by the speech recognition-enabled electronic device, a number of channels in the received second audio signal;
      
      obtaining, by the speech recognition-enabled electronic device, a number of contiguous noise frames in the received second audio signal;
      
      determining, by the speech recognition-enabled electronic device, a current channel index associated with each of the number of channels in the received second audio signal;
      
      obtaining, by the speech recognition-enabled electronic device, a look-back index;
      
      obtaining, by the speech recognition-enabled electronic device, a smoothed maximum difference of smoothed channel noise;
      
      obtaining, by the speech recognition-enabled electronic device, a high boundary point representing noise exhibiting high noise variability; and
      
      obtaining, by the speech recognition-enabled electronic device, a low boundary point representing noise exhibiting low noise variability.
  - 6. The computer-implemented method of claim 5, further comprising determining, by the speech recognition-enabled electronic device, the measure of the noise variability of the background noise based on the number of channels, the number of contiguous noise frames, the current channel index, the look-back index, the smoothed maximum difference of the smoothed channel noise, the high boundary point, and the low boundary point, wherein the measure of the noise variability of the background noise is a value greater than 0and less than 1.
  - 7. The computer-implemented method of claim 6, wherein the noise variability threshold is greater than 0.7 and the frame number threshold is greater than 20.

8. A system comprising:
- one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising;
  
  during a trigger phrase enrollment process;
  
  receiving a first audio signal corresponding to a user of a speech recognition-enable electronic device speaking a trigger phrase into the speech recognition-enabled electronic device, the first audio signal comprising a first number of frames having a measure of noise variability of background noise exceeding a noise variability threshold;
  
  when a count of the first number of frames in the first audio signal satisfies a frame number threshold, prompting the user to speak the trigger phrase into the speech recognition- enabled electronic device again;
  
  receiving a second audio signal corresponding to the user speaking the trigger phrase again, the second audio signal comprising a second number of frames having the measure of noise variability of background noise exceeding the noise variability threshold; and
  
  when a count of the second number of frames in the second audio signal dissatisfies the frame number threshold, training a trigger phrase model with the second audio signal corresponding to the user speaking the trigger phrase again; and
  
  after the trigger phrase enrollment process;
  
  receiving, while the speech recognition-enabled electronic device is in a sleep mode, a third audio signal including an utterance of the trigger phrase spoken by the user; and
  
  detecting, using the trigger phrase model trained during the trigger phrase enrollment process, the utterance of the trigger phrase in the third audio signal, the trigger phrase when detected in the third audio signal causing the speech recognition-enabled electronic device to wake from the sleep mode, the sleep mode comprising a power-saving mode of operation in which one or more parts of the speech recognition-enabled electronic device are in a low-power state or powered off.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The system of claim 8, wherein the count of the first number of frames satisfies the frame number threshold when the first number of frames is greater than or equal to the frame number threshold, and wherein the count of the second number of frames dissatisfies the frame number threshold when the second number of frames is less than the frame number threshold.
  - 10. The system of claim 8, wherein the operations further comprise:
    - determining the measure of noise variability of the background noise for each frame in the received first audio signal;
      
      comparing the determined measure of noise variability of the background noise to the noise variability threshold; and
      
      incrementing a counter in response to determining the determined measure of noise variability of the background noise is greater than the noise variability threshold.
  - 11. The system of claim 8, wherein the operations further comprise:
    - determining the measure of noise variability of the background noise for each frame in the received second audio signal;
      
      comparing the determined measure of noise variability of the background noise to the noise variability threshold; and
      
      incrementing a counter in response to determining the determined measure of noise variability of the background noise is greater than the noise variability threshold.
  - 12. The system of claim 11, wherein determining the measure of the noise variability of the background noise for each frame in the received second audio signal comprises:
    - obtaining a number of channels in the received second audio signal;
      
      obtaining a number of contiguous noise frames in the received second audio signal;
      
      determining a current channel index associated with each of the number of channels in the received second audio signal;
      
      obtaining a look-back index;
      
      obtaining a smoothed maximum difference of smoothed channel noise;
      
      obtaining a high boundary point representing noise exhibiting high noise variability; and
      
      obtaining a low boundary point representing noise exhibiting low noise variability.
  - 13. The system of claim 12, wherein the operations further comprise determining the measure of the noise variability of the background noise based on the number of channels, the number of contiguous noise frames, the current channel index, the look-back index, the smoothed maximum difference of the smoothed channel noise, the high boundary point, and the low boundary point, wherein the measure of the noise variability of the background noise is a value greater than 0 and less than 1.
  - 14. The system of claim 13, wherein the noise variability threshold is greater than 0.7 and the frame number threshold is greater than 20.

15. A non-transitory computer-readable medium storing software comprising instructions executable by one or more computers which, upon such execution, cause the one or more computers to perform operations comprising:
- during a trigger phrase enrollment process;
  
  receiving a first audio signal corresponding to a user of a speech recognition-enabled electronic device speaking a trigger phrase into the speech recognition-enabled electronic device, the first audio signal comprising a first number of frames having a measure of noise variability of background noise exceeding a noise variability threshold;
  
  when a count of the first number of frames in the first audio signal satisfies a frame number threshold, prompting the user to speak the trigger phrase into the speech recognition-enabled electronic device again;
  
  receiving a second audio signal corresponding to the user speaking the trigger phrase again, the second audio signal comprising a second number of frames having the measure of noise variability of background noise exceeding the noise variability threshold; and
  
  when a count of the second number of frames in the second audio signal dissatisfies the frame number threshold, training a trigger phrase model with the second audio signal corresponding to the user speaking the trigger phrase again; and
  
  after the trigger phrase enrollment process;
  
  receiving, while the speech recognition-enabled electronic device is in a sleep mode, a third audio signal including an utterance of the trigger phrase spoken by the user; and
  
  detecting, using the trigger phrase model trained during the trigger phrase enrollment process, the utterance of the trigger phrase in the third audio signal, the trigger phrase when detected in the third audio signal causing the speech recognition-enabled electronic device to wake from the sleep mode, the sleep mode comprising a power-saving mode of operation in which one or more parts of the speech recognition-enabled electronic device are in a low-power state or powered off.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The computer-readable medium of claim 15, wherein the count of the first number of frames satisfies the frame number threshold when the first number of frames is greater than or equal to the frame number threshold, and wherein the count of the second number of frames dissatisfies the frame number threshold when the second number of frames is less than the frame number threshold.
  - 17. The computer-readable medium of claim 15, wherein the operations further comprise:
    - determining the measure of noise variability of the background noise for each frame in the received first audio signal;
      
      comparing the determined measure of noise variability of the background noise to the noise variability threshold; and
      
      incrementing a counter in response to determining the determined measure of noise variability of the background noise is greater than the noise variability threshold.
  - 18. The computer-readable medium of claim 15, wherein the operations further comprise:
    - determining the measure of noise variability of the background noise for each frame in the received second audio signal;
      
      comparing the determined measure of noise variability of the background noise to the noise variability threshold; and
      
      incrementing a counter in response to determining the determined measure of noise variability of the background noise is greater than the noise variability threshold.
  - 19. The computer-readable medium of claim 18, wherein determining the measure of the noise variability of the background noise for each frame in the received second audio signal comprises:
    - obtaining a number of channels in the received second audio signal;
      
      obtaining a number of contiguous noise frames in the received second audio signal;
      
      determining a current channel index associated with each of the number of channels in the received second audio signal;
      
      obtaining a look-back index;
      
      obtaining a smoothed maximum difference of smoothed channel noise;
      
      obtaining a high boundary point representing noise exhibiting high noise variability; and
      
      obtaining a low boundary point representing noise exhibiting low noise variability.
  - 20. The computer-readable medium of claim 19, wherein the operations further comprise determining the measure of the noise variability of the background noise based on the number of channels, the number of contiguous noise frames, the current channel index, the look-back index, the smoothed maximum difference of the smoothed channel noise, the high boundary point, and the low boundary point, wherein the measure of the noise variability of the background noise is a value greater than 0 and less than 1.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Technology Holdings LLC (Alphabet Inc.)
Inventors
Clark, Joel A., Ramabadran, Tenkasi V., Jasiuk, Mark A.
Primary Examiner(s)
Han, Qi

Application Number

US15/612,693
Publication Number

US 20170270913A1
Time in Patent Office

606 Days
Field of Search

704244, 704231, 704233, 704235, 704246, 704249, 704250, 704251, 704255
US Class Current
CPC Class Codes

G10L 15/063   Training

G10L 15/1807   using prosody or stress

G10L 15/20   Speech recognition techniqu...

G10L 2015/088   Word spotting

G10L 21/0264   characterised by the type o...

G10L 25/84   for discriminating voice fr...

Method and apparatus for evaluating trigger phrase enrollment

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

132 Citations

20 Claims

Specification

Use Cases

Quick Links

Others

Method and apparatus for evaluating trigger phrase enrollment

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

132 Citations

20 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others