Speaker independent speech recognition system and method using neural network and DTW matching technique

US 5,528,728 A
Filed: 07/12/1993
Issued: 06/18/1996
Est. Priority Date: 07/12/1993
Status: Expired due to Term

First Claim

Patent Images

1. A speaker independent apparatus for word recognition comprising:

a) input means for inputting an utterance by an unspecified person into an electrical signal;

b) characteristic extracting means for receiving the electrical signal from the input means and converting the electrical signal into a time series of discrete characteristic multidimensional vectors;

c) phoneme recognition means for selectively receiving the time series of discrete characteristic multidimensional vectors and converting each of said selectively received vectors into a plurality of candidates of phonemes from a first order to an n-th order (n denotes an arbitrary number);

d) word recognition means for receiving a time series string of phonemes from the phoneme recognition means and comparing the plurality of candidates of phonemes, one at a time, with each phoneme of a reference string of phonemes for words previously stored in a dictionary until a final phoneme of the reference string of phonemes for a last word of the words stored in the dictionary and determining a time series of phonemes derived from said phoneme recognition means having a highest similarity to one of the reference strings of the phonemes for the words stored in the dictionary using a predetermined word matching technique;

e) output means for outputting at least one of said candidates of phonemes as a result of word recognition carried out by the word recognition means on the basis of a similarity determination on the plurality of candidates of phonemes with respect to the reference strings of the words stored in said dictionary; and

f) selecting means, interposed between said characteristic extracting means and phoneme recognition means, for selecting a center of a given number of frames of a continued time-series of discrete characteristic multidimensional vectors derived from said characteristic extracting means so that said phoneme recognition means receives the center thereof.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Improved speaker independent speech recognition system and method are disclosed in which an utterance by an unspecified person into an electrical signal is input through a device such as a telephone, the electrical signal from the input telephone converting the electrical signal into a time series of characteristic multidimensional vectors, the time series of characteristic multidimensional vectors are received, each of the vectors being converted into a plurality of candidates so that the plurality of phonemes constitutes a plurality of strings of phonemes in time series as a plurality of candidates, the plurality of candidates of phonemes are compared simultaneously (one at a time) with a reference pattern of a reference string of phonemes for each word previously stored in a dictionary to determine which string of phonemes derived from the phoneme recognition means has a highest similarity to one of the reference strings of the phonemes for the respective words stored in the dictionary using a predetermined word matching technique, and at least one candidate of the words as a result of word recognition on the basis of one of the plurality of the strings of phonemes which has the highest similarity to the corresponding one of the reference strings of the respective words is output as the result of speech recognition.

Citations

11 Claims

1. A speaker independent apparatus for word recognition comprising:
- a) input means for inputting an utterance by an unspecified person into an electrical signal;
  
  b) characteristic extracting means for receiving the electrical signal from the input means and converting the electrical signal into a time series of discrete characteristic multidimensional vectors;
  
  c) phoneme recognition means for selectively receiving the time series of discrete characteristic multidimensional vectors and converting each of said selectively received vectors into a plurality of candidates of phonemes from a first order to an n-th order (n denotes an arbitrary number);
  
  d) word recognition means for receiving a time series string of phonemes from the phoneme recognition means and comparing the plurality of candidates of phonemes, one at a time, with each phoneme of a reference string of phonemes for words previously stored in a dictionary until a final phoneme of the reference string of phonemes for a last word of the words stored in the dictionary and determining a time series of phonemes derived from said phoneme recognition means having a highest similarity to one of the reference strings of the phonemes for the words stored in the dictionary using a predetermined word matching technique;
  
  e) output means for outputting at least one of said candidates of phonemes as a result of word recognition carried out by the word recognition means on the basis of a similarity determination on the plurality of candidates of phonemes with respect to the reference strings of the words stored in said dictionary; and
  
  f) selecting means, interposed between said characteristic extracting means and phoneme recognition means, for selecting a center of a given number of frames of a continued time-series of discrete characteristic multidimensional vectors derived from said characteristic extracting means so that said phoneme recognition means receives the center thereof.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. An apparatus as set forth in claim 1, wherein said word recognition means for determining the time series of strings of phonemes derived from said phoneme recognition means has a totally similarity to the time series reference string of phonemes for each word depending upon a distance from one of the plurality of candidates of the phonemes to one of the phonemes of the reference string of the phonemes in a two-dimensional coordinate matrix and wherein said output means outputs a plurality of word candidates as the result of word recognition according to a derived distance, a first order of the plurality of word candidates being the word having the reference string of phonemes with a smallest distance to the candidates of the time series of phonemes derived from said phoneme recognition means.
  - 3. An apparatus as set forth in claim 1, wherein said predetermined word matching technique, of said word recognition means, is a Dynamic Time Warping (DTW) technique, having an accumulated distance g(i, j) from a time i of the input string of phonemes in the time series to a time j of the reference string of phonemes stored in one of memory areas of the dictionary and calculated using an intervector distance d(i, j) from one of the phonemes of the input string of the phonemes at the item i to one of the phonemes of the reference string of phonemes at the time j;
    - g(i, j)=min {g(i-1, j), g (i-1, j-1), g (i-1, j-2)}+d (i, j), such that
      
           1) if one of the phonemes of the reference string of the phonemes coincides with a corresponding order one of the phonemes of the input string of the phonemes of the first order of candidates, the intervector distance is 0;
      
           2) if one of the phonemes of the reference string of phonemes does not coincide with the corresponding order of the phonemes of the input string of the phonemes of the first order of the candidates but coincides with the corresponding order phoneme of the phonemes of a second order of candidates, the intervector distance is 1; and
      
           3) in other cases, the intervector distance is 2, at least one path minimizing the accumulated distance is prepared, and a minimum path is derived from the paths prepared for each reference string of phonemes constituting the respective words, the word having the reference string of phonemes with the minimum path to the corresponding input string of the phonemes being output as the result of word recognition.
  - 4. An apparatus as set forth in claim 1, wherein said phoneme recognition means includes back-propagation type parallel run Neural Networks.
  - 5. An apparatus as set forth in claim 4, wherein said Neural Networks retrieve five frames of the time series of the characteristic vectors from said characteristic extraction means to input layers thereof, shifts the five frames by one frame, and outputs at least one phoneme corresponding to a center frame of the input five frames of the time series characteristic vectors from output layers thereof.
  - 6. An apparatus as set forth in claim 5, wherein said Neural Networks comprise 135 input layers, 21 output layers, and 120 hidden layers.
  - 7. An apparatus as set forth in claim 6, wherein said output means outputs the result of word recognition in the form of encoded data of the words.
  - 8. An apparatus as set forth in claim 7, wherein the result of word recognition in the form of the encoded data from said output means is used to control a power plant.
  - 9. An apparatus as set forth in claim 1, wherein in said predetermined word matching technique, of said word recognition means, a number of branches S{I}{J} is determined under such a condition that (1) two branches are not intersected across each other and (2) only one branch can be drawn from each of phonemes when one of the plurality of the phonemes derived from the phoneme recognition means is compared with one of the reference strings of phonemes for the respective words in such a way as:
    - if A {i}=B {j}, S{i}{j}=S {i-1} {j-1}+1 else S {i}{j}=max (S{i-1}{j}, S{i}{j-1}), wherein A {i}denotes either of the string of phonemes A {i};
      
      i=1 to I derived from the phoneme recognition means or retrieved from the dictionary and B{j} denotes the other string of phonemes to be compared B{j};
      
      j=1 to J and a maximum number of branches N is derived from among S {I}{J} and a magnitude of similarity is derived as N/LA+N/LB, wherein LA denotes a length of the string of phonemes A{i};
      
      i=1 to I and LB denotes a length of the string of phonemes B{j};
      
      j=1 to J, the magnitude of similarity being derived until a final order of candidates from the phoneme recognition means is compared with a final reference string of the word stored in a memory area of the dictionary.

10. A method of speaker independent speech recognition comprising the steps of:
- a) inputting an utterance by an unspecified person into an electrical signal;
  
  b) receiving the electrical signal from the input means and converting the electrical signal into a time series of discrete characteristic multidimensional vectors;
  
  c) receiving the time series of discrete characteristic multidimensional vectors and converting each of said vectors into a plurality of candidates of phonemes from a first order to an n-th order (n denotes an arbitrary number);
  
  d) receiving a time series strings of phonemes derived at said step c) and comparing the plurality of candidates of phonemes, one at a time, with each phoneme of a reference string of phonemes for each word previously stored in a dictionary until a final phoneme of the reference string of phonemes for a last word of the words stored in the dictionary and determining which time series of phonemes derived at said step c) has totally highest similarity to one of the reference strings of the phonemes for the respective words stored in the dictionary using a predetermined word matching technique;
  
  e) outputting at least one word candidate as a result of word recognition carried out at said step d) on the basis of the similarity determination on the plurality of candidates of phonemes

11. A speaker independent apparatus for word recognition comprising:
- a) input means for inputting an utterance by an unspecified person into an electrical signal;
  
  b) characteristic extracting means for receiving the electrical signal from said input means and converting the electrical signal into a time series of discrete characteristic multidimensional vectors;
  
  c) phoneme recognition means for selectively receiving the time series of discrete characteristic multidimensional vectors and converting each of said selectively received vectors into a plurality of candidates of phonemes from a first order to an n-th order (n denotes an arbitrary integer exceeding
  
  1), said phoneme recognition means including selecting means for selecting a center of a given number of frames of a continued time-series of discrete characteristic multidimensional vectors derived from said characteristic extracting means so as to receive the center thereof;
  
  d) storing means for storing a plurality of reference data of words as a dictionary;
  
  e) word recognition means for receiving a time series string of phonemes from the phoneme recognition means and comparing the plurality of candidates of phonemes, one at a time, with each phoneme of a reference string of phonemes for a last word of the words stored in the dictionary and determining a time series of phonemes derived from said phoneme recognition means having a highest similarity to one of the reference strings of the phonemes for the words stored in the dictionary using a special word matching technique; and
  
  f) output means for outputting at least one of said candidates of phonemes as the result of word recognition carried out by the word recognition means on the basis of a similarity determination on the plurality of candidates of phonemes with respect to the reference strings of the words stored in the dictionary,wherein said word recognition means determines a time series of strings of phonemes derived from said phoneme recognition means having a highest similarity to the time series reference string of phonemes for each word depending upon a distance from one of the plurality of candidates of the phonemes in a two-dimensional coordinate matrix,wherein said output means outputs a plurality of word candidates as the result of word recognition according to a derived distance, a first order of the plurality of word candidates being the word having the reference string of phonemes with a smallest distance to the candidates of the time series of phonemes derived from said phoneme recognition means,and wherein said special word matching technique being such that a number of branches S{I}{J} is determined under such a condition that two branches are not intersected across each other and (2) only one branch can be drawn from each of phonemes when one of the plurality of the phonemes derived from said phoneme recognition means is compared with one of the reference strings of phonemes for the respective words in such a way as;
  
  if A{i}=B {j}, S{i-1}{j-1}+1 else S{i}{j}=max (S{i-1}{j}, S{i}{j-1}), wherein max denotes either larger one of S{i-1}{j} or S{i}{j-1}, A {i} denotes either of the strings of phonemes A {i};
  
  i=I derived from the phoneme recognition means or retrieved from the dictionary and B{j} denotes the other string of phonemes to be compared B{j};
  
  j=i to J, and a maximum number of branches N is derived from among S{I}{J} and a magnitude of similarity is derived as N/LA+N/LB, wherein LA denotes a length of the string of phonemes B{j};
  
  j=1 to J, the magnitude of similarity being derived until a final order of candidates from the phoneme recognition means is compared with a final reference string of the word stored in a memory area of the dictionary. with respect to the reference strings of the words stored in said dictionary; and
  
  f) selecting a center of a given number of frames of a continued time-series of discrete characteristic multidimensional vectors derived from said step b) so that the center of the frames is received in said step c).

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Adaptive Solutions, Kabushiki Kaisha Meidensha (Meidensha Corporation)
Original Assignee
Adaptive Solutions, Kabushiki Kaisha Meidensha (Meidensha Corporation)
Inventors
Matsuura, Yoshihiro, Skinner, Toby
Primary Examiner(s)
MacDonald, Allen R.
Assistant Examiner(s)
Dorvil, Richemond

Application Number

US08/089,825
Time in Patent Office

1,072 Days
Field of Search

395/2, 395/2.6, 395/2.4, 395/2.41, 395/2.3, 395/2.61, 395/2.63, 395/2.5, 381/43
US Class Current

704/232
CPC Class Codes

G10L 15/12   using dynamic programming t...

G10L 15/16   using artificial neural net...

G10L 2015/025   Phonemes, fenemes or fenone...

Speaker independent speech recognition system and method using neural network and DTW matching technique

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

11 Claims

Specification

Solutions

Use Cases

Quick Links

Speaker independent speech recognition system and method using neural network and DTW matching technique

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

11 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links