User defined key phrase detection by user dependent sequence modeling

US 10,043,521 B2
Filed: 07/01/2016
Issued: 08/07/2018
Est. Priority Date: 07/01/2016
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method for user dependent key phrase enrollment comprising:

receiving, via a microphone, an audio input representing a user defined key phrase and converting the audio input to received audio data representative of the audio input;

determining a sequence of most probable audio units corresponding to the received audio data, wherein each audio unit of most probable audio units corresponds to a frame of a plurality of frames of the audio data;

processing the sequence of most probable audio units to eliminate at least one audio unit from the sequence of most probable audio units to generate a final sequence of audio units bydetermining a first silence audio unit of the sequence and a number of silence audio units immediately temporally following the first silence audio unit,wherein the first silence audio unit and the number of silence audio units are between non-silence audio units of the sequence, andeliminating the first silence audio unit and the immediately temporally following silence audio units in response to the total number of consecutive silence audio units not exceeding a threshold;

generating a key phrase recognition model representing the user defined key phrase based on the final sequence of audio units, the key phrase recognition model comprising a single start state based rejection model, a key phrase model, and a transition from the single start state based rejection model to the key phrase model,wherein the single start state based rejection model includes a single rejection state having a plurality of rejection model self loops, wherein the key phrase model comprises a plurality of states having transitions therebetween, the plurality of states including a final state of the key phrase model, and wherein the plurality of states of the key phrase model correspond to the final sequence of audio units;

receiving a further audio input for evaluation by the key phrase recognition model;

generating a time series of scores of audio units based on a time series of feature vectors representative of the further audio input;

scoring the key phrase recognition model based on the time series of scores of audio units to generate a rejection likelihood score and a key phrase likelihood score; and

recognizing that the further audio input corresponds to the user defined key phrase based on the rejection likelihood score and the key phrase likelihood score.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Techniques related to key phrase detection for applications such as wake on voice are discussed. Such techniques may include determining a sequence of audio units for received audio input representing a user defined key phrase, eliminating audio units from the sequence to generate a final sequence of audio units, and generating a key phrase recognition model representing the user defined key phrase based on the final sequence.

71 Citations

View as Search Results

20 Claims

1. A computer-implemented method for user dependent key phrase enrollment comprising:
- receiving, via a microphone, an audio input representing a user defined key phrase and converting the audio input to received audio data representative of the audio input;
  
  determining a sequence of most probable audio units corresponding to the received audio data, wherein each audio unit of most probable audio units corresponds to a frame of a plurality of frames of the audio data;
  
  processing the sequence of most probable audio units to eliminate at least one audio unit from the sequence of most probable audio units to generate a final sequence of audio units bydetermining a first silence audio unit of the sequence and a number of silence audio units immediately temporally following the first silence audio unit,wherein the first silence audio unit and the number of silence audio units are between non-silence audio units of the sequence, andeliminating the first silence audio unit and the immediately temporally following silence audio units in response to the total number of consecutive silence audio units not exceeding a threshold;
  
  generating a key phrase recognition model representing the user defined key phrase based on the final sequence of audio units, the key phrase recognition model comprising a single start state based rejection model, a key phrase model, and a transition from the single start state based rejection model to the key phrase model,wherein the single start state based rejection model includes a single rejection state having a plurality of rejection model self loops, wherein the key phrase model comprises a plurality of states having transitions therebetween, the plurality of states including a final state of the key phrase model, and wherein the plurality of states of the key phrase model correspond to the final sequence of audio units;
  
  receiving a further audio input for evaluation by the key phrase recognition model;
  
  generating a time series of scores of audio units based on a time series of feature vectors representative of the further audio input;
  
  scoring the key phrase recognition model based on the time series of scores of audio units to generate a rejection likelihood score and a key phrase likelihood score; and
  
  recognizing that the further audio input corresponds to the user defined key phrase based on the rejection likelihood score and the key phrase likelihood score.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method of claim 1, wherein processing the sequence of most probable audio units to eliminate at least one audio unit comprises determining a first sub-phonetic audio unit of the sequence and a second sub-phonetic audio unit of the sequence immediately temporally following the first sub-phonetic audio unit match and eliminating the first or second sub-phonetic audio unit from the sequence of most probable audio units responsive to the first and second sub-phonetic audio unit matching.
  - 3. The method of claim 1, wherein processing the sequence of most probable audio units to eliminate at least one audio unit comprises determining a number of non-silence sub-phonetic audio units are temporally between a first block of silence audio units and a second block of silence audio units of the sequence and eliminating the non-silence sub-phonetic audio units in response to the number of non-silence sub-phonetic audio units, temporally between the first and second blocks of silence audio units, not exceeding a threshold.
  - 4. The method of claim 1, further comprising generating a second final sequence of audio units corresponding to a second received audio input, wherein the key phrase recognition model further comprises a second transition from the single start state based rejection model to a second key phrase model, wherein the second key phrase model comprises a plurality of second states having second transitions therebetween, the plurality of second states including a second final state of the second key phrase model, wherein the plurality of second states of the second key phrase model correspond to the second final sequence of audio units.
  - 5. The method of claim 4, wherein the second received audio input represents a second user defined key phrase different than the user defined key phrase.
  - 6. The method of claim 1, further comprising generating a second final sequence of audio units corresponding to a second received audio input, wherein the key phrase recognition model further comprises a second transition from the single start state based rejection model to a second key phrase model, wherein the second key phrase model comprises a plurality of second states having second transitions therebetween, the plurality of second states including the final state of the key phrase model shared with the second key phrase model, wherein the plurality of second states of the second key phrase model correspond to the second final sequence of audio units.
  - 7. The method of claim 1, wherein determining the sequence of most probable audio units corresponding to the received audio data comprises:
    - extracting a feature vector for each frame of the received audio data to generate a time sequence of feature vectors; and
      
      decoding the time sequence of feature vectors based on an acoustic model to determine the sequence of most probable audio units.
  - 8. The method of claim 7, wherein decoding the time sequence of feature vectors comprises implementing a deep neural network, wherein the sequence of most probable audio units corresponds to a sequence of highest probability output nodes of the deep neural network determined based on the time sequence of feature vectors.
  - 9. The method of claim 1, whereinthe rejection likelihood score corresponds to the single start state based rejection model,the key phrase likelihood score corresponds to the final state of the key phrase model, anddetermining whether the further audio input corresponds to the user defined key phrase comprisesdetermining a log likelihood score based on the rejection likelihood score and the key phrase likelihood score andcomparing the log likelihood score to a threshold.
  - 10. The method of claim 1, further comprising:
    - pruning an acoustic model by removing outputs not corresponding to the key phrase recognition model to generate a pruned acoustic model,wherein generating the time series of scores of audio units comprises implementing the pruned acoustic model.

11. A system for providing user dependent key phrase enrollment comprising:
- a microphone to receive an audio input representing a user defined key phrase and to convert the audio input to received audio data representative of the audio input;
  
  a memory to store the received audio input and a key phrase recognition model; and
  
  a processor coupled to the memory,the processor;
  
  to determine a sequence of most probable audio units corresponding to the received audio data, wherein each audio unit of the most probable audio units corresponds to a frame of a plurality of frames of the audio data,to process the sequence of most probable audio units to eliminate at least one audio unit from the sequence of most probable audio units to generate a final sequence of audio units by the processor to determine a first silence audio unit of the sequence and a number of silence audio units immediately temporally following the first silence audio unit, wherein the first silence audio unit and the number of silence audio units are between non-silence audio units of the sequence, and to eliminate the first silence audio unit and the immediately temporally following silence audio units in response to the total number of consecutive silence audio units not exceeding a threshold, andto generate the key phrase recognition model representing the user defined key phrase based on the final sequence of audio units, the key phrase recognition model comprising a single start state based rejection model, a key phrase model, and a transition from the single start state based rejection model to the key phrase model, wherein the single start state based rejection model includes a single rejection state having a plurality of rejection model self loops, wherein the key phrase model comprises a plurality of states having transitions therebetween, the plurality of states including a final state of the key phrase model, and wherein the plurality of states of the key phrase model correspond to the final sequence of audio units,wherein the microphone receives a further audio input for evaluation by the key phrase recognition model,the processor further;
  
  to generate a time series of scores of audio units based on a time series of feature vectors representative of the further audio input,to score the key phrase recognition model based on the time series of scores of audio units to generate a rejection likelihood score and a key phrase likelihood score, andto recognize that the further audio input corresponds to the user defined key phrase based on the rejection likelihood score and the key phrase likelihood score.
- View Dependent Claims (12, 13, 14, 15)
- - 12. The system of claim 11, wherein to process the sequence of most probable audio units to eliminate at least one audio unit comprises the processor to determine a first sub-phonetic audio unit of the sequence and a second sub-phonetic audio unit of the sequence immediately temporally following the first sub-phonetic audio unit match and to eliminate the first or second sub-phonetic audio unit from the sequence of most probable audio units responsive to the first and second sub-phonetic audio unit matching.
  - 13. The system of claim 11, wherein to process the sequence of most probable audio units to eliminate at least one audio unit comprises the processor to determine a number of non-silence sub-phonetic audio units are temporally between a first block of silence audio units and a second block of silence audio units of the sequence and to eliminate the non-silence sub-phonetic audio units in response to the number of non-silence sub-phonetic audio units, temporally between the first and second blocks of silence audio units, not exceeding a threshold.
  - 14. The system of claim 11, the processor further to generate a second final sequence of audio units corresponding to a second received audio input, wherein the key phrase recognition model further comprises a second transition from the single start state based rejection model to a second key phrase model, wherein the second key phrase model comprises a plurality of second states having second transitions therebetween, the plurality of second states including a second final state of the second key phrase model, wherein the plurality of second states of the second key phrase model correspond to the second final sequence of audio units.
  - 15. The system of claim 11, the processor further to generate a second final sequence of audio units corresponding to a second received audio input, wherein the key phrase recognition model further comprises a second transition from the single start state based rejection model to a second key phrase model, wherein the second key phrase model comprises a plurality of second states having second transitions therebetween, the plurality of second states including the final state of the key phrase model shared with the second key phrase model, wherein the plurality of second states of the second key phrase model correspond to the second final sequence of audio units.

16. At least one non-transitory machine readable medium comprising a plurality of instructions that, in response to being executed on a computing device, cause the computing device to provide user dependent key phrase enrollment by:
- receiving, via a microphone, an audio input representing a user defined key phrase and converting the audio input to received audio data representative of the audio input;
  
  determining a sequence of most probable audio units corresponding to the received audio data, wherein each audio unit of the most probable audio units corresponds to a frame of a plurality of frames of the audio data;
  
  processing the sequence of most probable audio units to eliminate at least one audio unit from the sequence of most probable audio units to generate a final sequence of audio units by determining a first silence audio unit of the sequence and a number of silence audio units immediately temporally following the first silence audio unit, wherein the first silence audio unit and the number of silence audio units are between non-silence audio units of the sequence, and eliminating the first silence audio unit and the immediately temporally following silence audio units in response to the total number of consecutive silence audio units not exceeding a threshold;
  
  generating a key phrase recognition model representing the user defined key phrase based on the final sequence of audio units, the key phrase recognition model comprising a single start state based rejection model, a key phrase model, and a transition from the single start state based rejection model to the key phrase model, wherein the single start state based rejection model includes a single rejection state having a plurality of rejection model self loops, wherein the key phrase model comprises a plurality of states having transitions therebetween, the plurality of states including a final state of the key phrase model, and wherein the plurality of states of the key phrase model correspond to the final sequence of audio units;
  
  receiving a further audio input for evaluation by the key phrase recognition model;
  
  generating a time series of scores of audio units based on a time series of feature vectors representative of the further audio input;
  
  scoring the key phrase recognition model based on the time series of scores of audio units to generate a rejection likelihood score and a key phrase likelihood score; and
  
  recognizing that the further audio input corresponds to the user defined key phrase based on the rejection likelihood score and the key phrase likelihood score.
- View Dependent Claims (17, 18, 19, 20)
- - 17. The machine readable medium of claim 16, wherein processing the sequence of most probable audio units to eliminate at least one audio unit comprises determining a first sub-phonetic audio unit of the sequence and a second sub-phonetic audio unit of the sequence immediately temporally following the first sub-phonetic audio unit match and eliminating the first or second sub-phonetic audio unit from the sequence of most probable audio units responsive to the first and second sub-phonetic audio unit matching.
  - 18. The machine readable medium of claim 16, wherein processing the sequence of most probable audio units to eliminate at least one audio unit comprises determining a number of non-silence sub-phonetic audio units are temporally between a first block of silence audio units and a second block of silence audio units of the sequence and eliminating the non-silence sub-phonetic audio units in response to the number of non-silence sub-phonetic audio units, temporally between the first and second blocks of silence audio units, not exceeding a threshold.
  - 19. The machine readable medium of claim 16, further comprising instructions that, in response to being executed on the computing device, cause the computing device to provide user dependent key phrase enrollment by:
    - generating a second final sequence of audio units corresponding to a second received audio input, wherein the key phrase recognition model further comprises a second transition from the single start state based rejection model to a second key phrase model, wherein the second key phrase model comprises a plurality of second states having second transitions therebetween, the plurality of second states including a second final state of the second key phrase model, wherein the plurality of second states of the second key phrase model correspond to the second final sequence of audio units.
  - 20. The machine readable medium of claim 16, further comprising instructions that, in response to being executed on the computing device, cause the computing device to provide user dependent key phrase enrollment by:
    - generating a second final sequence of audio units corresponding to a second received audio input, wherein the key phrase recognition model further comprises a second transition from the single start based rejection state model to a second key phrase model, wherein the second key phrase model comprises a plurality of second states having second transitions therebetween, the plurality of second states including the final state of the key phrase model shared with the second key phrase model, wherein the plurality of second states of the second key phrase model correspond to the second final sequence of audio units.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Intel Corporation
Original Assignee
Intel IP Corporation (Intel Corporation)
Inventors
Bocklet, Tobias, Bauer, Josef G.
Primary Examiner(s)
Sirjani, Fariba

Application Number

US15/201,016
Publication Number

US 20180005633A1
Time in Patent Office

767 Days
Field of Search

None
US Class Current
CPC Class Codes

G10L 15/08   Speech classification or se...

G10L 17/02   Preprocessing operations, e...

G10L 17/04   Training, enrolment or mode...

G10L 17/14   Use of phonemic categorisat...

G10L 17/18   Artificial neural networks;...

G10L 17/24   the user being prompted to ...

G10L 2015/088   Word spotting

User defined key phrase detection by user dependent sequence modeling

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

71 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

User defined key phrase detection by user dependent sequence modeling

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

71 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links