Automatic speech recognizer for real time operation
First Claim
1. In a speech analyzer having a set of stored reference pattern templates t=1,2, . . . ,V each comprising a time frame sequence i=1,2, . . . , It of acoustic feature signals including an end frame i=It representative of an identified reference pattern, a method for recognizing an unknown utterance as a string of reference patterns e.g. t1,t2, . . . ,t3 comprising the steps of:
- producing signals representative of the time frame sequence j=1,2 . . . ,J of acoustic features of the utterance responsive to the acoustic pattern of the utterance;
generating at least one reference pattern string e.g. t1,t2, . . . ,t3 responsive to the acoustic feature signals of the time frame sequence j=1,2, . . . ,J of the utterance and the acoustic feature signals of the time frame sequence i=1,2, . . . ,It of the reference patterns t=1,2, . . . ,V; and
identifying the utterance as one of said reference pattern strings e.g. t1,t2, . . . ,t3;
wherein the step of producing signals representative of the time frame sequence j=1,2, . . . ,J of acoustic features of the utterance comprisesreceiving the currently occurring time frame portion of the utterance;
generating a signal j identifying the time frame in which the current portion of the utterance occurs in the succession of utterance time frames j=1,2, . . . ,J responsive to the currently occurring portion of the utterance; and
producing a signal representative of the acoustic features of jth frame portion of the utterance responsive to the received currently occurring time frame portion of the utterance;
said step of generating at least one reference pattern string e.g. t1,t2, . . . ,t3 responsive to the acoustic feature signals of the time frame sequence of the utterance and acoustic feature signals of the time frame sequence i=1,2, . . . ,Iv of the reference patterns comprises;
responsive to the producing of the acoustic feature signals of the currently occurring portion of the utterance in the current time frame j, performing the following steps(a) producing a set of signals identifying levels L=1,2, . . . ,LMAX, each level corresponding to the position of a reference pattern in the at least one reference pattern string;
(b) time registering the acoustic feature signals of the current time frame j portion of the utterance with the acoustic feature signals of the time frames i=1,2, . . . ,It of each reference pattern for each level L=1,2, . . . ,LMAX responsive to the acoustic feature signals of the current time frame portion of the utterance and the acoustic feature signals of the time frame portions of the reference patterns; and
(c) producing a set of cumulative correspondence signals for the time registration path ending time frames It of the reference patterns at levels L=1,2, . . . LMAX for the currently occurring time frame j portion of the utterance; and
the step of identifying the utterance as one of said reference pattern strings e.g. t1,t2, . . . ,t3 comprises generating signals representative of reference pattern strings after the formation of the time registration path and time registration path correspondence signals of the levels for the last utterance time frame J responsive to the time registration path and time registration path cumulative correspondence signals for the reference pattern ending time frames It of levels L=1,2, . . . ,LMAX of the utterance portion time frames j=1,2, . . . J.
1 Assignment
0 Petitions
Accused Products
Abstract
A speech recognizer identifies an unknown utterance as a variable length string of stored reference patterns in a single pass through the time frame sequence of utterance feature signals. A plurality of reference pattern levels are used to permit strings of varying lengths. As each utterance time frame portion is received, its acoustic feature signals are time registered with the reference pattern feature signals at each reference pattern level to form reference pattern endframe registration path and registration path correspondence signals. Responsive to the plurality of level reference pattern end frame registration path signals, reference pattern strings are selected for the current utterance frame. The utterance is identified as the selected reference string with the best correspondence to the utterance from the registration path signals of the reference levels of the last utterance time frame.
-
Citations
6 Claims
-
1. In a speech analyzer having a set of stored reference pattern templates t=1,2, . . . ,V each comprising a time frame sequence i=1,2, . . . , It of acoustic feature signals including an end frame i=It representative of an identified reference pattern, a method for recognizing an unknown utterance as a string of reference patterns e.g. t1,t2, . . . ,t3 comprising the steps of:
-
producing signals representative of the time frame sequence j=1,2 . . . ,J of acoustic features of the utterance responsive to the acoustic pattern of the utterance; generating at least one reference pattern string e.g. t1,t2, . . . ,t3 responsive to the acoustic feature signals of the time frame sequence j=1,2, . . . ,J of the utterance and the acoustic feature signals of the time frame sequence i=1,2, . . . ,It of the reference patterns t=1,2, . . . ,V; and identifying the utterance as one of said reference pattern strings e.g. t1,t2, . . . ,t3; wherein the step of producing signals representative of the time frame sequence j=1,2, . . . ,J of acoustic features of the utterance comprises receiving the currently occurring time frame portion of the utterance; generating a signal j identifying the time frame in which the current portion of the utterance occurs in the succession of utterance time frames j=1,2, . . . ,J responsive to the currently occurring portion of the utterance; and producing a signal representative of the acoustic features of jth frame portion of the utterance responsive to the received currently occurring time frame portion of the utterance; said step of generating at least one reference pattern string e.g. t1,t2, . . . ,t3 responsive to the acoustic feature signals of the time frame sequence of the utterance and acoustic feature signals of the time frame sequence i=1,2, . . . ,Iv of the reference patterns comprises; responsive to the producing of the acoustic feature signals of the currently occurring portion of the utterance in the current time frame j, performing the following steps (a) producing a set of signals identifying levels L=1,2, . . . ,LMAX, each level corresponding to the position of a reference pattern in the at least one reference pattern string; (b) time registering the acoustic feature signals of the current time frame j portion of the utterance with the acoustic feature signals of the time frames i=1,2, . . . ,It of each reference pattern for each level L=1,2, . . . ,LMAX responsive to the acoustic feature signals of the current time frame portion of the utterance and the acoustic feature signals of the time frame portions of the reference patterns; and (c) producing a set of cumulative correspondence signals for the time registration path ending time frames It of the reference patterns at levels L=1,2, . . . LMAX for the currently occurring time frame j portion of the utterance; and the step of identifying the utterance as one of said reference pattern strings e.g. t1,t2, . . . ,t3 comprises generating signals representative of reference pattern strings after the formation of the time registration path and time registration path correspondence signals of the levels for the last utterance time frame J responsive to the time registration path and time registration path cumulative correspondence signals for the reference pattern ending time frames It of levels L=1,2, . . . ,LMAX of the utterance portion time frames j=1,2, . . . J. - View Dependent Claims (2, 3, 4, 5)
-
-
6. In a speech analyzer having a set of stored reference word templates t=1,2, . . . V each comprising a time frame sequence i=1,2, . . . It extending to a word ending boundary frame i=It of acoustic feature signals representative of an identified reference word, a method for recognizing an input speech pattern as a string of predetermined reference words e.g. t1,t2 . . . t3 comprising the steps of:
-
producing signals representative of the acoustic features of the successive time frames j=1,2, . . . J of the input speech pattern; generating at least one string e.g. t1,t2, . . . t3 of the identified reference words responsive to the acoustic feature signals of the time frame sequence j=1,2, . . . J of the input speech pattern and the time frame sequence i=1,2, . . . It of acoustic feature signals of the reference words t=1,2, . . . V; and identifying the input speech pattern as one of said reference word strings e.g. t1,t2, . . . t3; wherein the step of producing the signals representative of the acoustic features of successive time frames of the input speech pattern comprises receiving the currently occurring time frame portion of the input speech pattern; generating a signal identifying the time frame j corresponding to the currently occurring portion of the input speech pattern; and forming a signal representative of the acoustic features of the currently occurring jth time frame portion of the speech pattern responsive to the currently occurring portion of the input speech pattern; and the step of generating at least one reference word string e.g. t1,t2, . . . t3 responsive to acoustic feature signals of the time frames j=1,2, . . . J of the speech pattern and the acoustic feature signals of time frames i=1,2, . . . It of reference words t=1,2, . . . ,V comprises; for the currently occurring speech pattern portion time frame j in the succession of speech pattern time frame portions j=1,2, . . . J producing a set of signals identifying a plurality of reference word levels L=1,2, . . . LMAX for the currently occurring speech pattern time frame portion j; and
for each identified level signal L=1,2, . . . ,LMAX in the currently occurring speech pattern time frame portion j performing steps a, b, c and d(a) forming a signal d(t,i,j) representative of the distance between the acoustic features of the currently occurring speech pattern time frame portion j and the acoustic features of each reference word time frame i=1,2, . . . ,It responsive to the acoustic feature signals of the currently occurring speech pattern time frame portion j and the acoustic feature signals of the reference word time frames i=1,2, . . . ,It, (b) forming a signal Lp(t,i,j,L) representative of the time registration path of the speech pattern and each reference word pattern distance signals d(t,i,j) formed for the currently occurring speech pattern time frame portion j and the distance signals for the preceding time frame portions j=j-1,j-2, . . . ,1 of the speech pattern; (c) forming a signal s(t,i,j,L) representative of the cumulative distance between the speech pattern acoustic features and the reference word features along the time registration paths Lp(t,i,j,L) up to the currently occurring speech time frame j responsive to the distance signals d(t,i,j) of the currently occurring speech pattern time frame j and the j=j-1,j-2 . . .1 preceding time frames of the speech pattern; and (d) for the word ending boundary frame (It) of each reference word, generating signals T(j,L), F(j,L) identifying reference word strings and signals S(j,L) representative of the cumulative distance between the identified reference word strings and the speech pattern responsive to the time registration path and cumulative distance signals of the word level; and the step of identifying the speech pattern as one of said reference word strings comprises, after the last speech pattern time frame J, selecting the best matching reference word string responsive to the cumulative distance signals S(j,L) of the identified reference word strings.
-
Specification