Speaker independent speech recognition system and method using neural network and DTW matching technique
First Claim
1. A speaker independent apparatus for word recognition comprising:
- a) input means for inputting an utterance by an unspecified person into an electrical signal;
b) characteristic extracting means for receiving the electrical signal from the input means and converting the electrical signal into a time series of discrete characteristic multidimensional vectors;
c) phoneme recognition means for selectively receiving the time series of discrete characteristic multidimensional vectors and converting each of said selectively received vectors into a plurality of candidates of phonemes from a first order to an n-th order (n denotes an arbitrary number);
d) word recognition means for receiving a time series string of phonemes from the phoneme recognition means and comparing the plurality of candidates of phonemes, one at a time, with each phoneme of a reference string of phonemes for words previously stored in a dictionary until a final phoneme of the reference string of phonemes for a last word of the words stored in the dictionary and determining a time series of phonemes derived from said phoneme recognition means having a highest similarity to one of the reference strings of the phonemes for the words stored in the dictionary using a predetermined word matching technique;
e) output means for outputting at least one of said candidates of phonemes as a result of word recognition carried out by the word recognition means on the basis of a similarity determination on the plurality of candidates of phonemes with respect to the reference strings of the words stored in said dictionary; and
f) selecting means, interposed between said characteristic extracting means and phoneme recognition means, for selecting a center of a given number of frames of a continued time-series of discrete characteristic multidimensional vectors derived from said characteristic extracting means so that said phoneme recognition means receives the center thereof.
1 Assignment
0 Petitions
Accused Products
Abstract
Improved speaker independent speech recognition system and method are disclosed in which an utterance by an unspecified person into an electrical signal is input through a device such as a telephone, the electrical signal from the input telephone converting the electrical signal into a time series of characteristic multidimensional vectors, the time series of characteristic multidimensional vectors are received, each of the vectors being converted into a plurality of candidates so that the plurality of phonemes constitutes a plurality of strings of phonemes in time series as a plurality of candidates, the plurality of candidates of phonemes are compared simultaneously (one at a time) with a reference pattern of a reference string of phonemes for each word previously stored in a dictionary to determine which string of phonemes derived from the phoneme recognition means has a highest similarity to one of the reference strings of the phonemes for the respective words stored in the dictionary using a predetermined word matching technique, and at least one candidate of the words as a result of word recognition on the basis of one of the plurality of the strings of phonemes which has the highest similarity to the corresponding one of the reference strings of the respective words is output as the result of speech recognition.
-
Citations
11 Claims
-
1. A speaker independent apparatus for word recognition comprising:
-
a) input means for inputting an utterance by an unspecified person into an electrical signal; b) characteristic extracting means for receiving the electrical signal from the input means and converting the electrical signal into a time series of discrete characteristic multidimensional vectors; c) phoneme recognition means for selectively receiving the time series of discrete characteristic multidimensional vectors and converting each of said selectively received vectors into a plurality of candidates of phonemes from a first order to an n-th order (n denotes an arbitrary number); d) word recognition means for receiving a time series string of phonemes from the phoneme recognition means and comparing the plurality of candidates of phonemes, one at a time, with each phoneme of a reference string of phonemes for words previously stored in a dictionary until a final phoneme of the reference string of phonemes for a last word of the words stored in the dictionary and determining a time series of phonemes derived from said phoneme recognition means having a highest similarity to one of the reference strings of the phonemes for the words stored in the dictionary using a predetermined word matching technique; e) output means for outputting at least one of said candidates of phonemes as a result of word recognition carried out by the word recognition means on the basis of a similarity determination on the plurality of candidates of phonemes with respect to the reference strings of the words stored in said dictionary; and f) selecting means, interposed between said characteristic extracting means and phoneme recognition means, for selecting a center of a given number of frames of a continued time-series of discrete characteristic multidimensional vectors derived from said characteristic extracting means so that said phoneme recognition means receives the center thereof. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A method of speaker independent speech recognition comprising the steps of:
-
a) inputting an utterance by an unspecified person into an electrical signal; b) receiving the electrical signal from the input means and converting the electrical signal into a time series of discrete characteristic multidimensional vectors; c) receiving the time series of discrete characteristic multidimensional vectors and converting each of said vectors into a plurality of candidates of phonemes from a first order to an n-th order (n denotes an arbitrary number); d) receiving a time series strings of phonemes derived at said step c) and comparing the plurality of candidates of phonemes, one at a time, with each phoneme of a reference string of phonemes for each word previously stored in a dictionary until a final phoneme of the reference string of phonemes for a last word of the words stored in the dictionary and determining which time series of phonemes derived at said step c) has totally highest similarity to one of the reference strings of the phonemes for the respective words stored in the dictionary using a predetermined word matching technique; e) outputting at least one word candidate as a result of word recognition carried out at said step d) on the basis of the similarity determination on the plurality of candidates of phonemes
-
-
11. A speaker independent apparatus for word recognition comprising:
-
a) input means for inputting an utterance by an unspecified person into an electrical signal; b) characteristic extracting means for receiving the electrical signal from said input means and converting the electrical signal into a time series of discrete characteristic multidimensional vectors; c) phoneme recognition means for selectively receiving the time series of discrete characteristic multidimensional vectors and converting each of said selectively received vectors into a plurality of candidates of phonemes from a first order to an n-th order (n denotes an arbitrary integer exceeding
1), said phoneme recognition means including selecting means for selecting a center of a given number of frames of a continued time-series of discrete characteristic multidimensional vectors derived from said characteristic extracting means so as to receive the center thereof;d) storing means for storing a plurality of reference data of words as a dictionary; e) word recognition means for receiving a time series string of phonemes from the phoneme recognition means and comparing the plurality of candidates of phonemes, one at a time, with each phoneme of a reference string of phonemes for a last word of the words stored in the dictionary and determining a time series of phonemes derived from said phoneme recognition means having a highest similarity to one of the reference strings of the phonemes for the words stored in the dictionary using a special word matching technique; and f) output means for outputting at least one of said candidates of phonemes as the result of word recognition carried out by the word recognition means on the basis of a similarity determination on the plurality of candidates of phonemes with respect to the reference strings of the words stored in the dictionary, wherein said word recognition means determines a time series of strings of phonemes derived from said phoneme recognition means having a highest similarity to the time series reference string of phonemes for each word depending upon a distance from one of the plurality of candidates of the phonemes in a two-dimensional coordinate matrix, wherein said output means outputs a plurality of word candidates as the result of word recognition according to a derived distance, a first order of the plurality of word candidates being the word having the reference string of phonemes with a smallest distance to the candidates of the time series of phonemes derived from said phoneme recognition means, and wherein said special word matching technique being such that a number of branches S{I}{J} is determined under such a condition that two branches are not intersected across each other and (2) only one branch can be drawn from each of phonemes when one of the plurality of the phonemes derived from said phoneme recognition means is compared with one of the reference strings of phonemes for the respective words in such a way as;
if A{i}=B {j}, S{i-1}{j-1}+1 else S{i}{j}=max (S{i-1}{j}, S{i}{j-1}), wherein max denotes either larger one of S{i-1}{j} or S{i}{j-1}, A {i} denotes either of the strings of phonemes A {i};
i=I derived from the phoneme recognition means or retrieved from the dictionary and B{j} denotes the other string of phonemes to be compared B{j};
j=i to J, and a maximum number of branches N is derived from among S{I}{J} and a magnitude of similarity is derived as N/LA+N/LB, wherein LA denotes a length of the string of phonemes B{j};
j=1 to J, the magnitude of similarity being derived until a final order of candidates from the phoneme recognition means is compared with a final reference string of the word stored in a memory area of the dictionary. with respect to the reference strings of the words stored in said dictionary; andf) selecting a center of a given number of frames of a continued time-series of discrete characteristic multidimensional vectors derived from said step b) so that the center of the frames is received in said step c).
-
Specification