Automatic speech recognizer for real time operation

US 4,783,809 A
Filed: 11/07/1984
Issued: 11/08/1988
Est. Priority Date: 11/07/1984
Status: Expired due to Term

First Claim

Patent Images

1. In a speech analyzer having a set of stored reference pattern templates t=1,2, . . . ,V each comprising a time frame sequence i=1,2, . . . , It of acoustic feature signals including an end frame i=It representative of an identified reference pattern, a method for recognizing an unknown utterance as a string of reference patterns e.g. t1,t2, . . . ,t3 comprising the steps of:

producing signals representative of the time frame sequence j=1,2 . . . ,J of acoustic features of the utterance responsive to the acoustic pattern of the utterance;

generating at least one reference pattern string e.g. t1,t2, . . . ,t3 responsive to the acoustic feature signals of the time frame sequence j=1,2, . . . ,J of the utterance and the acoustic feature signals of the time frame sequence i=1,2, . . . ,It of the reference patterns t=1,2, . . . ,V; and

identifying the utterance as one of said reference pattern strings e.g. t1,t2, . . . ,t3;

wherein the step of producing signals representative of the time frame sequence j=1,2, . . . ,J of acoustic features of the utterance comprisesreceiving the currently occurring time frame portion of the utterance;

generating a signal j identifying the time frame in which the current portion of the utterance occurs in the succession of utterance time frames j=1,2, . . . ,J responsive to the currently occurring portion of the utterance; and

producing a signal representative of the acoustic features of jth frame portion of the utterance responsive to the received currently occurring time frame portion of the utterance;

said step of generating at least one reference pattern string e.g. t1,t2, . . . ,t3 responsive to the acoustic feature signals of the time frame sequence of the utterance and acoustic feature signals of the time frame sequence i=1,2, . . . ,Iv of the reference patterns comprises;

responsive to the producing of the acoustic feature signals of the currently occurring portion of the utterance in the current time frame j, performing the following steps(a) producing a set of signals identifying levels L=1,2, . . . ,LMAX, each level corresponding to the position of a reference pattern in the at least one reference pattern string;

(b) time registering the acoustic feature signals of the current time frame j portion of the utterance with the acoustic feature signals of the time frames i=1,2, . . . ,It of each reference pattern for each level L=1,2, . . . ,LMAX responsive to the acoustic feature signals of the current time frame portion of the utterance and the acoustic feature signals of the time frame portions of the reference patterns; and

(c) producing a set of cumulative correspondence signals for the time registration path ending time frames It of the reference patterns at levels L=1,2, . . . LMAX for the currently occurring time frame j portion of the utterance; and

the step of identifying the utterance as one of said reference pattern strings e.g. t1,t2, . . . ,t3 comprises generating signals representative of reference pattern strings after the formation of the time registration path and time registration path correspondence signals of the levels for the last utterance time frame J responsive to the time registration path and time registration path cumulative correspondence signals for the reference pattern ending time frames It of levels L=1,2, . . . ,LMAX of the utterance portion time frames j=1,2, . . . J.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A speech recognizer identifies an unknown utterance as a variable length string of stored reference patterns in a single pass through the time frame sequence of utterance feature signals. A plurality of reference pattern levels are used to permit strings of varying lengths. As each utterance time frame portion is received, its acoustic feature signals are time registered with the reference pattern feature signals at each reference pattern level to form reference pattern endframe registration path and registration path correspondence signals. Responsive to the plurality of level reference pattern end frame registration path signals, reference pattern strings are selected for the current utterance frame. The utterance is identified as the selected reference string with the best correspondence to the utterance from the registration path signals of the reference levels of the last utterance time frame.

Citations

6 Claims

1. In a speech analyzer having a set of stored reference pattern templates t=1,2, . . . ,V each comprising a time frame sequence i=1,2, . . . , It of acoustic feature signals including an end frame i=It representative of an identified reference pattern, a method for recognizing an unknown utterance as a string of reference patterns e.g. t1,t2, . . . ,t3 comprising the steps of:
- producing signals representative of the time frame sequence j=1,2 . . . ,J of acoustic features of the utterance responsive to the acoustic pattern of the utterance;
  
  generating at least one reference pattern string e.g. t1,t2, . . . ,t3 responsive to the acoustic feature signals of the time frame sequence j=1,2, . . . ,J of the utterance and the acoustic feature signals of the time frame sequence i=1,2, . . . ,It of the reference patterns t=1,2, . . . ,V; and
  
  identifying the utterance as one of said reference pattern strings e.g. t1,t2, . . . ,t3;
  
  wherein the step of producing signals representative of the time frame sequence j=1,2, . . . ,J of acoustic features of the utterance comprisesreceiving the currently occurring time frame portion of the utterance;
  
  generating a signal j identifying the time frame in which the current portion of the utterance occurs in the succession of utterance time frames j=1,2, . . . ,J responsive to the currently occurring portion of the utterance; and
  
  producing a signal representative of the acoustic features of jth frame portion of the utterance responsive to the received currently occurring time frame portion of the utterance;
  
  said step of generating at least one reference pattern string e.g. t1,t2, . . . ,t3 responsive to the acoustic feature signals of the time frame sequence of the utterance and acoustic feature signals of the time frame sequence i=1,2, . . . ,Iv of the reference patterns comprises;
  
  responsive to the producing of the acoustic feature signals of the currently occurring portion of the utterance in the current time frame j, performing the following steps(a) producing a set of signals identifying levels L=1,2, . . . ,LMAX, each level corresponding to the position of a reference pattern in the at least one reference pattern string;
  
  (b) time registering the acoustic feature signals of the current time frame j portion of the utterance with the acoustic feature signals of the time frames i=1,2, . . . ,It of each reference pattern for each level L=1,2, . . . ,LMAX responsive to the acoustic feature signals of the current time frame portion of the utterance and the acoustic feature signals of the time frame portions of the reference patterns; and
  
  (c) producing a set of cumulative correspondence signals for the time registration path ending time frames It of the reference patterns at levels L=1,2, . . . LMAX for the currently occurring time frame j portion of the utterance; and
  
  the step of identifying the utterance as one of said reference pattern strings e.g. t1,t2, . . . ,t3 comprises generating signals representative of reference pattern strings after the formation of the time registration path and time registration path correspondence signals of the levels for the last utterance time frame J responsive to the time registration path and time registration path cumulative correspondence signals for the reference pattern ending time frames It of levels L=1,2, . . . ,LMAX of the utterance portion time frames j=1,2, . . . J.
- View Dependent Claims (2, 3, 4, 5)
- - 2. In a speech analyzer having a set of stored reference pattern templates t=1,2, . . . V each comprising a time frame sequence i=1,2, . . .IR of acoustic feature signals including an end frame i=It representative of an identified reference pattern, a method for recognizing an unknown utterance as a string of reference patterns e.g. t1,t2, . . . t3 according to claim 1 wherein said time registering step comprises:
    - generating a set of signals representative of the correspondence between the acoustic feature signals of the currently occurring time frame j portion of the utterance and the acoustic feature signals of the time frames i=1,2, . . . ,It of each reference pattern; and
      
      producing a set of signals representative of the time registration path between the acoustic feature signals of the time frames i=1,2, . . . ,It of each reference pattern and the acoustic feature signals of the currently occurring time frame j portion and the preceding time frame portions of the uttterance j=j-1,j-2, . . . ,1 responsive to the correspondence signals generated for the the currently occurring time frame j portion of the utterance and the correspondence signals generated for the preceding utterance time frame portions j=j-1, j-2, . . . ,1.
  - 3. In a speech analyzer having a set of stored reference pattern templates t=1,2, . . . V each comprising a time frame sequence i=1,2, . . . It of acoustic feature signals including an end frame i=It representative of an identified reference pattern, a method for recognizing an unknown utterance as a string of reference patterns e.g. t1,t2, . . . t3 according to claim 2 whereinthe step of producing a set of cumulative correspondence signals for the time registration path ending time frames It of the reference patterns at levels L=1,2, . . . LMAX for the currently occurring time frames j portion of the utterance comprises generating cumulative correspondence signals between the acoustic feature signals of the currently occurring time frame j portion of the utterance and the acouostic feature signals of the time frame i portions of the reference patterns in the order of reference frame time portions from i=It,It-I,It-2, . . . ,1.
  - 4. In a speech analyzer having a set of stored reference pattern templates t=1,2, . . . V each comprising a time frame sequence i=1,2, . . . It of acoustic feature signals including an end frame i=It representative of an identified reference pattern, a method for recognizing an unknown utterance as a string of reference patterns e.g. t1,t2, . . . t3 according to claim 1, 2, or 3 wherein said reference patterns are reference word patterns.
  - 5. In a speech analyzer having a set of stored reference pattern templates t=1,2, . . . V each comprising a time frame sequence i=1,2, . . . It of acoustic feature signals including an end frame i=It representative of an identified reference pattern, a method for recognizing an unknown utterance as a string of reference patterns e.g. t1,t2, . . . t3 according to claim 1, 2, or 3 wherein said acoustic feature signals are predictive feature signals.

6. In a speech analyzer having a set of stored reference word templates t=1,2, . . . V each comprising a time frame sequence i=1,2, . . . It extending to a word ending boundary frame i=It of acoustic feature signals representative of an identified reference word, a method for recognizing an input speech pattern as a string of predetermined reference words e.g. t1,t2 . . . t3 comprising the steps of:
- producing signals representative of the acoustic features of the successive time frames j=1,2, . . . J of the input speech pattern;
  
  generating at least one string e.g. t1,t2, . . . t3 of the identified reference words responsive to the acoustic feature signals of the time frame sequence j=1,2, . . . J of the input speech pattern and the time frame sequence i=1,2, . . . It of acoustic feature signals of the reference words t=1,2, . . . V; and
  
  identifying the input speech pattern as one of said reference word strings e.g. t1,t2, . . . t3;
  
  wherein the step of producing the signals representative of the acoustic features of successive time frames of the input speech pattern comprisesreceiving the currently occurring time frame portion of the input speech pattern;
  
  generating a signal identifying the time frame j corresponding to the currently occurring portion of the input speech pattern; and
  
  forming a signal representative of the acoustic features of the currently occurring jth time frame portion of the speech pattern responsive to the currently occurring portion of the input speech pattern; and
  
  the step of generating at least one reference word string e.g. t1,t2, . . . t3 responsive to acoustic feature signals of the time frames j=1,2, . . . J of the speech pattern and the acoustic feature signals of time frames i=1,2, . . . It of reference words t=1,2, . . . ,V comprises;
  
  for the currently occurring speech pattern portion time frame j in the succession of speech pattern time frame portions j=1,2, . . . Jproducing a set of signals identifying a plurality of reference word levels L=1,2, . . . LMAX for the currently occurring speech pattern time frame portion j; and
  
  for each identified level signal L=1,2, . . . ,LMAX in the currently occurring speech pattern time frame portion j performing steps a, b, c and d(a) forming a signal d(t,i,j) representative of the distance between the acoustic features of the currently occurring speech pattern time frame portion j and the acoustic features of each reference word time frame i=1,2, . . . ,It responsive to the acoustic feature signals of the currently occurring speech pattern time frame portion j and the acoustic feature signals of the reference word time frames i=1,2, . . . ,It,(b) forming a signal Lp(t,i,j,L) representative of the time registration path of the speech pattern and each reference word pattern distance signals d(t,i,j) formed for the currently occurring speech pattern time frame portion j and the distance signals for the preceding time frame portions j=j-1,j-2, . . . ,1 of the speech pattern;
  
  (c) forming a signal s(t,i,j,L) representative of the cumulative distance between the speech pattern acoustic features and the reference word features along the time registration paths Lp(t,i,j,L) up to the currently occurring speech time frame j responsive to the distance signals d(t,i,j) of the currently occurring speech pattern time frame j and the j=j-1,j-2 . . .1 preceding time frames of the speech pattern; and
  
  (d) for the word ending boundary frame (It) of each reference word, generating signals T(j,L), F(j,L) identifying reference word strings and signals S(j,L) representative of the cumulative distance between the identified reference word strings and the speech pattern responsive to the time registration path and cumulative distance signals of the word level; and
  
  the step of identifying the speech pattern as one of said reference word strings comprises, after the last speech pattern time frame J, selecting the best matching reference word string responsive to the cumulative distance signals S(j,L) of the identified reference word strings.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Bell Telephone Laboratories, Inc. (Nokia Corporation)
Original Assignee
American Telephone & Telegraph Company (AT&T, Inc.)
Inventors
Glinski, Stephen C.
Primary Examiner(s)
Kemeny, Emanuel S.
Assistant Examiner(s)
Knepper, David D.

Application Number

US06/669,246
Time in Patent Office

1,462 Days
Field of Search

381/42, 381/43, 381/41, 364/513.5
US Class Current

704/244
CPC Class Codes

G10L 15/00 Speech recognition G10L17/0...

Automatic speech recognizer for real time operation

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

6 Claims

Specification

Solutions

Use Cases

Quick Links

Automatic speech recognizer for real time operation

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

6 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links