Method and apparatus for evaluating trigger phrase enrollment

US 10,163,439 B2
Filed: 05/31/2017
Issued: 12/25/2018
Est. Priority Date: 07/31/2013
Status: Expired due to Fees

First Claim

Patent Images

1. A computer-implemented method comprising:

during a trigger phrase enrollment process;

prompting, by a speech recognition-enabled electronic device, a user of the speech recognition-enabled electronic device to speak a trigger phrase;

receiving, at the speech recognition-enabled device, a first audio signal corresponding to the user speaking the trigger phrase;

selecting, by the speech recognition-enabled device, a shortest segment among a plurality of segments in the first audio signal that have voice activity, each segment comprising a corresponding sequence of contiguous frames in the first audio signal that have voice activity;

determining, by the data processing hardware, a length of the shortest segment by counting a number of frames in the sequence of contiguous frames corresponding to the shortest segment; and

when the length of the shortest segment in the first audio signal satisfies a threshold value, training, by the speech recognition-enabled electronic device, a trigger phrase model with the first audio signal corresponding to the user speaking the trigger phrase, the trigger phrase model configured to detect the trigger phrase in a spoken utterance; and

after the trigger phrase enrollment process;

receiving, at the speech recognition-enabled device and while the speech recognition-enabled electronic device is in a sleep mode, a second audio signal including an utterance of the trigger phrase spoken by the user; and

detecting, by the speech recognition-enabled electronic device and using the trigger phrase model trained during the trigger phrase enrollment process, the utterance of the trigger phrase in the second audio signal, the trigger phrase when detected in the second audio signal causing the speech recognition-enabled electronic device to wake from the sleep mode, the sleep mode comprising a power-saving mode of operation in which one or more parts of the speech recognition-enabled electronic device are in a low-power state or powered off.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An electronic device includes a microphone that receives an audio signal that includes a spoken trigger phrase, and a processor that is electrically coupled to the microphone. The processor measures characteristics of the audio signal, and determines, based on the measured characteristics, whether the spoken trigger phrase is acceptable for trigger phrase model training. If the spoken trigger phrase is determined not to be acceptable for trigger phrase model training, the processor rejects the trigger phrase for trigger phrase model training.

Citations

20 Claims

1. A computer-implemented method comprising:
- during a trigger phrase enrollment process;
  
  prompting, by a speech recognition-enabled electronic device, a user of the speech recognition-enabled electronic device to speak a trigger phrase;
  
  receiving, at the speech recognition-enabled device, a first audio signal corresponding to the user speaking the trigger phrase;
  
  selecting, by the speech recognition-enabled device, a shortest segment among a plurality of segments in the first audio signal that have voice activity, each segment comprising a corresponding sequence of contiguous frames in the first audio signal that have voice activity;
  
  determining, by the data processing hardware, a length of the shortest segment by counting a number of frames in the sequence of contiguous frames corresponding to the shortest segment; and
  
  when the length of the shortest segment in the first audio signal satisfies a threshold value, training, by the speech recognition-enabled electronic device, a trigger phrase model with the first audio signal corresponding to the user speaking the trigger phrase, the trigger phrase model configured to detect the trigger phrase in a spoken utterance; and
  
  after the trigger phrase enrollment process;
  
  receiving, at the speech recognition-enabled device and while the speech recognition-enabled electronic device is in a sleep mode, a second audio signal including an utterance of the trigger phrase spoken by the user; and
  
  detecting, by the speech recognition-enabled electronic device and using the trigger phrase model trained during the trigger phrase enrollment process, the utterance of the trigger phrase in the second audio signal, the trigger phrase when detected in the second audio signal causing the speech recognition-enabled electronic device to wake from the sleep mode, the sleep mode comprising a power-saving mode of operation in which one or more parts of the speech recognition-enabled electronic device are in a low-power state or powered off.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The computer-implemented method of claim 1, further comprising, when the length of the shortest segment in the first audio signal that has voice activity dissatisfies the threshold value:
    - prompting, by the speech recognition-enabled electronic device, the user to speak the trigger phrase again; and
      
      rejecting, by the speech recognition-enabled electronic device, the first audio signal corresponding to the user speaking the trigger phrase for use in training the trigger phrase model.
  - 3. The computer-implemented method of claim 1, further comprising, for each frame in the received first audio signal:
    - identifying, by speech recognition-enabled electronic device, audio characteristics of the user speaking the trigger phrase and audio characteristics of background noise for the corresponding frame in the received first audio signal;
      
      comparing, by the speech recognition-enabled electronic device, the identified audio characteristics of the user speaking the trigger phrase to predetermined threshold values associated with one or more values for trigger phrase model training; and
      
      determining, by the speech recognition-enabled electronic device, a voice activity detection flag for the corresponding frame in the received first audio signal in response to comparing the identified audio characteristics of the user speaking the trigger phrase to the predetermined threshold values.
  - 4. The computer-implemented method of claim 3, wherein determining the voice activity detection flag for the corresponding frame in the received first audio signal comprises:
    - generating an accept enrollment flag in response to the identified audio characteristics of the user speaking the trigger phrase being less than the predetermined threshold values; and
      
      generating a reject enrollment flag in response to the identified audio characteristics of the user speaking the trigger phrase being greater than the predetermined threshold values.
  - 5. The computer-implemented method of claim 1, wherein selecting the shortest segment among the plurality of segments in the first audio signal that have voice activity comprises determining a lowest number of contiguous frames in the received audio signal that comprise the accept enrollment flag.
  - 6. The computer-implemented method of claim 5, further comprising:
    - comparing, by the speech recognition-enabled electronic device, the number of frames in the sequence of contiguous frames corresponding to the shortest segment to the threshold value, the threshold value comprising a threshold frame value; and
      
      when the number of frames in the sequence of contiguous frames corresponding to the shortest segment is less than the threshold frame value, prompting, by the speech recognition-enabled electronic device, the user to speak the trigger phrase in a second attempt.
  - 7. The computer-implemented method of claim 6, wherein the threshold frame count is 27 frames.

8. A system comprising:
- one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising;
  
  during a trigger phrase enrollment process for a speech recognition-enabled electronic device;
  
  prompting a user of the speech recognition-enabled electronic device to speak a trigger phrase;
  
  receiving a first audio signal corresponding to the user speaking the trigger phrase;
  
  selecting a shortest segment among a plurality of segments in the first audio signal that have voice activity, each segment comprising a corresponding sequence of contiguous frames in the first audio signal that have voice activity;
  
  determining a length of the shortest segment by counting a number of frames in the sequence of contiguous frames corresponding to the shortest segment; and
  
  when the length of the shortest segment in the first audio signal satisfies a threshold value, training a trigger phrase model with the first audio signal corresponding to the user speaking the trigger phrase, the trigger phrase model configured to detect the trigger phrase in a spoken utterance; and
  
  after the trigger phrase enrollment process;
  
  receiving, while the speech recognition-enabled electronic device is in a sleep mode, a second audio signal including an utterance of the trigger phrase spoken by the user; and
  
  detecting, using the trigger phrase model trained during the trigger phrase enrollment process, the utterance of the trigger phrase in the second audio signal, the trigger phrase when detected in the second audio signal causing the speech recognition-enabled electronic device to wake from the sleep mode, the sleep mode comprising a power-saving mode of operation in which one or more parts of the speech recognition-enabled electronic device are in a low-power state or powered off.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The system of claim 8, wherein the operations further comprise, when the length of the shortest segment in the first audio signal that has voice activity dissatisfies the threshold value:
    - prompting the user to speak the trigger phrase again; and
      
      rejecting the first audio signal corresponding to the user speaking the trigger phrase for use in training the trigger phrase model.
  - 10. The system of claim 8, wherein the operations further comprise, for each frame in the received first audio signal:
    - identifying audio characteristics of the user speaking the trigger phrase and audio characteristics of background noise for the corresponding frame in the received first audio signal;
      
      comparing the identified audio characteristics of the user speaking the trigger phrase to predetermined threshold values associated with one or more values for trigger phrase model training; and
      
      determining a voice activity detection flag for the corresponding frame in the received first audio signal in response to comparing the identified audio characteristics of the user speaking the trigger phrase to the predetermined threshold values.
  - 11. The system of claim 10, wherein determining the voice activity detection flag for the corresponding frame in the received first audio signal comprises:
    - generating an accept enrollment flag in response to the identified audio characteristics of the user speaking the trigger phrase being less than the predetermined threshold values; and
      
      generating a reject enrollment flag in response to the identified audio characteristics of the user speaking the trigger phrase being greater than the predetermined threshold values.
  - 12. The system of claim 8, wherein selecting the shortest segment among the plurality of segments in the first audio signal that have voice activity comprises determining a lowest number of contiguous frames in the received audio signal that comprise the accept enrollment flag.
  - 13. The system of claim 12, wherein the operations further comprise:
    - comparing the number of frames in the sequence of contiguous frames corresponding to the shortest segment to the threshold value, the threshold value comprising a threshold frame value; and
      
      when the number of frames in the sequence of contiguous frames corresponding to the shortest segment is less than the threshold frame value, prompting the user to speak the trigger phrase in a second attempt.
  - 14. The system of claim 13, wherein the threshold frame count is 27 frames.

15. A non-transitory computer-readable medium storing software comprising instructions executable by one or more computers which, upon such execution, cause the one or more computers to perform operations comprising:
- during a trigger phrase enrollment process for a speech recognition-enabled electronic device;
  
  prompting a user of the speech recognition-enabled electronic device to speak a trigger phrase;
  
  receiving a first audio signal corresponding to the user speaking the trigger phrase;
  
  selecting a shortest segment among a plurality of segments in the first audio signal that have voice activity, each segment comprising a corresponding sequence of contiguous frames in the first audio signal that have voice activity;
  
  determining a length of the shortest segment by counting a number of frames in the sequence of contiguous frames corresponding to the shortest segment; and
  
  when the length of the shortest segment in the first audio signal satisfies a threshold value, training a trigger phrase model with the first audio signal corresponding to the user speaking the trigger phrase, the trigger phrase model configured to detect the trigger phrase in a spoken utterance; and
  
  after the trigger phrase enrollment process;
  
  receiving, while the speech recognition-enabled electronic device is in a sleep mode, a second audio signal including an utterance of the trigger phrase spoken by the user; and
  
  detecting, using the trigger phrase model trained during the trigger phrase enrollment process, the utterance of the trigger phrase in the second audio signal, the trigger phrase when detected in the second audio signal causing the speech recognition-enabled electronic device to wake from the sleep mode, the sleep mode comprising a power-saving mode of operation in which one or more parts of the speech recognition-enabled electronic device are in a low-power state or powered off.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The non-transitory computer-readable medium of claim 15, wherein the operations further comprise, when the length of the shortest segment in the first audio signal that has voice activity dissatisfies the threshold value:
    - prompting the user to speak the trigger phrase again; and
      
      rejecting the first audio signal corresponding to the user speaking the trigger phrase for use in training the trigger phrase model.
  - 17. The non-transitory computer-readable medium of claim 15, wherein the operations further comprise, for each frame in the received first audio signal:
    - identifying audio characteristics of the user speaking the trigger phrase and audio characteristics of background noise for the corresponding frame in the received first audio signal;
      
      comparing the identified audio characteristics of the user speaking the trigger phrase to predetermined threshold values associated with one or more values for trigger phrase model training; and
      
      determining a voice activity detection flag for the corresponding frame in the received first audio signal in response to comparing the identified audio characteristics of the user speaking the trigger phrase to the predetermined threshold values.
  - 18. The non-transitory computer-readable medium of claim 17, wherein determining the voice activity detection flag for the corresponding frame in the received first audio signal comprises:
    - generating an accept enrollment flag in response to the identified audio characteristics of the user speaking the trigger phrase being less than the predetermined threshold values; and
      
      generating a reject enrollment flag in response to the identified audio characteristics of the user speaking the trigger phrase being greater than the predetermined threshold values.
  - 19. The non-transitory computer-readable medium of claim 15, wherein selecting the shortest segment among the plurality of segments in the first audio signal that have voice activity comprises determining a lowest number of contiguous frames in the received audio signal that comprise the accept enrollment flag.
  - 20. The non-transitory computer-readable medium of claim 19, wherein the operations further comprise:
    - comparing the number of frames in the sequence of contiguous frames corresponding to the shortest segment to the threshold value, the threshold value comprising a threshold frame value; and
      
      when the number of frames in the sequence of contiguous frames corresponding to the shortest segment is less than the threshold frame value, prompting the user to speak the trigger phrase in a second attempt.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Technology Holdings LLC (Alphabet Inc.)
Inventors
Clark, Joel A., Ramabadran, Tenkasi V., Jasiuk, Mark A.
Primary Examiner(s)
Han, Qi

Application Number

US15/609,342
Publication Number

US 20170263244A1
Time in Patent Office

573 Days
Field of Search

704244, 704231, 704233, 704235, 704246, 704249, 704250, 704251, 704255, 704236, 7042562
US Class Current
CPC Class Codes

G10L 15/063   Training

G10L 15/1807   using prosody or stress

G10L 15/20   Speech recognition techniqu...

G10L 2015/088   Word spotting

G10L 21/0264   characterised by the type o...

G10L 25/84   for discriminating voice fr...

Method and apparatus for evaluating trigger phrase enrollment

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for evaluating trigger phrase enrollment

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links