User defined key phrase detection by user dependent sequence modeling
First Claim
Patent Images
1. A computer-implemented method for user dependent key phrase enrollment comprising:
- receiving, via a microphone, an audio input representing a user defined key phrase and converting the audio input to received audio data representative of the audio input;
determining a sequence of most probable audio units corresponding to the received audio data, wherein each audio unit of most probable audio units corresponds to a frame of a plurality of frames of the audio data;
processing the sequence of most probable audio units to eliminate at least one audio unit from the sequence of most probable audio units to generate a final sequence of audio units bydetermining a first silence audio unit of the sequence and a number of silence audio units immediately temporally following the first silence audio unit,wherein the first silence audio unit and the number of silence audio units are between non-silence audio units of the sequence, andeliminating the first silence audio unit and the immediately temporally following silence audio units in response to the total number of consecutive silence audio units not exceeding a threshold;
generating a key phrase recognition model representing the user defined key phrase based on the final sequence of audio units, the key phrase recognition model comprising a single start state based rejection model, a key phrase model, and a transition from the single start state based rejection model to the key phrase model,wherein the single start state based rejection model includes a single rejection state having a plurality of rejection model self loops, wherein the key phrase model comprises a plurality of states having transitions therebetween, the plurality of states including a final state of the key phrase model, and wherein the plurality of states of the key phrase model correspond to the final sequence of audio units;
receiving a further audio input for evaluation by the key phrase recognition model;
generating a time series of scores of audio units based on a time series of feature vectors representative of the further audio input;
scoring the key phrase recognition model based on the time series of scores of audio units to generate a rejection likelihood score and a key phrase likelihood score; and
recognizing that the further audio input corresponds to the user defined key phrase based on the rejection likelihood score and the key phrase likelihood score.
3 Assignments
0 Petitions
Accused Products
Abstract
Techniques related to key phrase detection for applications such as wake on voice are discussed. Such techniques may include determining a sequence of audio units for received audio input representing a user defined key phrase, eliminating audio units from the sequence to generate a final sequence of audio units, and generating a key phrase recognition model representing the user defined key phrase based on the final sequence.
71 Citations
20 Claims
-
1. A computer-implemented method for user dependent key phrase enrollment comprising:
-
receiving, via a microphone, an audio input representing a user defined key phrase and converting the audio input to received audio data representative of the audio input; determining a sequence of most probable audio units corresponding to the received audio data, wherein each audio unit of most probable audio units corresponds to a frame of a plurality of frames of the audio data; processing the sequence of most probable audio units to eliminate at least one audio unit from the sequence of most probable audio units to generate a final sequence of audio units by determining a first silence audio unit of the sequence and a number of silence audio units immediately temporally following the first silence audio unit, wherein the first silence audio unit and the number of silence audio units are between non-silence audio units of the sequence, and eliminating the first silence audio unit and the immediately temporally following silence audio units in response to the total number of consecutive silence audio units not exceeding a threshold; generating a key phrase recognition model representing the user defined key phrase based on the final sequence of audio units, the key phrase recognition model comprising a single start state based rejection model, a key phrase model, and a transition from the single start state based rejection model to the key phrase model, wherein the single start state based rejection model includes a single rejection state having a plurality of rejection model self loops, wherein the key phrase model comprises a plurality of states having transitions therebetween, the plurality of states including a final state of the key phrase model, and wherein the plurality of states of the key phrase model correspond to the final sequence of audio units; receiving a further audio input for evaluation by the key phrase recognition model; generating a time series of scores of audio units based on a time series of feature vectors representative of the further audio input; scoring the key phrase recognition model based on the time series of scores of audio units to generate a rejection likelihood score and a key phrase likelihood score; and recognizing that the further audio input corresponds to the user defined key phrase based on the rejection likelihood score and the key phrase likelihood score. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A system for providing user dependent key phrase enrollment comprising:
-
a microphone to receive an audio input representing a user defined key phrase and to convert the audio input to received audio data representative of the audio input; a memory to store the received audio input and a key phrase recognition model; and a processor coupled to the memory, the processor; to determine a sequence of most probable audio units corresponding to the received audio data, wherein each audio unit of the most probable audio units corresponds to a frame of a plurality of frames of the audio data, to process the sequence of most probable audio units to eliminate at least one audio unit from the sequence of most probable audio units to generate a final sequence of audio units by the processor to determine a first silence audio unit of the sequence and a number of silence audio units immediately temporally following the first silence audio unit, wherein the first silence audio unit and the number of silence audio units are between non-silence audio units of the sequence, and to eliminate the first silence audio unit and the immediately temporally following silence audio units in response to the total number of consecutive silence audio units not exceeding a threshold, and to generate the key phrase recognition model representing the user defined key phrase based on the final sequence of audio units, the key phrase recognition model comprising a single start state based rejection model, a key phrase model, and a transition from the single start state based rejection model to the key phrase model, wherein the single start state based rejection model includes a single rejection state having a plurality of rejection model self loops, wherein the key phrase model comprises a plurality of states having transitions therebetween, the plurality of states including a final state of the key phrase model, and wherein the plurality of states of the key phrase model correspond to the final sequence of audio units, wherein the microphone receives a further audio input for evaluation by the key phrase recognition model, the processor further; to generate a time series of scores of audio units based on a time series of feature vectors representative of the further audio input, to score the key phrase recognition model based on the time series of scores of audio units to generate a rejection likelihood score and a key phrase likelihood score, and to recognize that the further audio input corresponds to the user defined key phrase based on the rejection likelihood score and the key phrase likelihood score. - View Dependent Claims (12, 13, 14, 15)
-
-
16. At least one non-transitory machine readable medium comprising a plurality of instructions that, in response to being executed on a computing device, cause the computing device to provide user dependent key phrase enrollment by:
-
receiving, via a microphone, an audio input representing a user defined key phrase and converting the audio input to received audio data representative of the audio input; determining a sequence of most probable audio units corresponding to the received audio data, wherein each audio unit of the most probable audio units corresponds to a frame of a plurality of frames of the audio data; processing the sequence of most probable audio units to eliminate at least one audio unit from the sequence of most probable audio units to generate a final sequence of audio units by determining a first silence audio unit of the sequence and a number of silence audio units immediately temporally following the first silence audio unit, wherein the first silence audio unit and the number of silence audio units are between non-silence audio units of the sequence, and eliminating the first silence audio unit and the immediately temporally following silence audio units in response to the total number of consecutive silence audio units not exceeding a threshold; generating a key phrase recognition model representing the user defined key phrase based on the final sequence of audio units, the key phrase recognition model comprising a single start state based rejection model, a key phrase model, and a transition from the single start state based rejection model to the key phrase model, wherein the single start state based rejection model includes a single rejection state having a plurality of rejection model self loops, wherein the key phrase model comprises a plurality of states having transitions therebetween, the plurality of states including a final state of the key phrase model, and wherein the plurality of states of the key phrase model correspond to the final sequence of audio units; receiving a further audio input for evaluation by the key phrase recognition model; generating a time series of scores of audio units based on a time series of feature vectors representative of the further audio input; scoring the key phrase recognition model based on the time series of scores of audio units to generate a rejection likelihood score and a key phrase likelihood score; and recognizing that the further audio input corresponds to the user defined key phrase based on the rejection likelihood score and the key phrase likelihood score. - View Dependent Claims (17, 18, 19, 20)
-
Specification