Speech recognition method
First Claim
1. In a telephone modulating an input speech and having a built-in vocoder for encoding a modulated speech signal, a speech recognition method comprising:
- a training step of, if a user enters a telephone number and a speech corresponding to said telephone number, performing the encoding at said vocoder, detecting only a speech section using information output as a result of the encoding, and extracting and storing a feature of the detected speech section;
a recognition step of, if an input speech is received, performing encoding at said vocoder, detecting only a speech section using information output as a result of the encoding, extracting a feature of the detected speech section, comparing the extracted feature with features of registered words stored during said training step, and selecting a registered word having a feature most similar to that of the input speech; and
a step of determining a result of the recognition to be right if a similarity of the registered word selected at said recognition step does not exceed a predetermined threshold and automatically dialing a telephone number corresponding to the recognized word, wherein said recognition step comprises extracting LSP parameters that have been encoded at said vocoder and transforming the extracted LSP parameters into pseudo-cepstrums.
1 Assignment
0 Petitions
Accused Products
Abstract
The present invention relates to an automated dialing method for mobile telephones. According to the method, a user enters a telephone number via the keypad of the mobile phone, followed by speaking a corresponding codeword into the handset. The voice signal is encoded using the CODEC and vocoder already on board the mobile phone. The speech is divided into frames and each frame analyzed to ascertain its primary spectral features. These features are stored in memory as associated with the numeric keypad sequence. In recognition mode, the user speaks the codeword into the handset, which is analyzed in a like fashion as in training mode. The primary spectral features are compared with those stored in memory. When a match is declared according to preset criteria, the telephone number is automatically dialed by the mobile phone. Time warping techniques may be applied in the analysis to reduce timing variations.
31 Citations
27 Claims
-
1. In a telephone modulating an input speech and having a built-in vocoder for encoding a modulated speech signal, a speech recognition method comprising:
-
a training step of, if a user enters a telephone number and a speech corresponding to said telephone number, performing the encoding at said vocoder, detecting only a speech section using information output as a result of the encoding, and extracting and storing a feature of the detected speech section;
a recognition step of, if an input speech is received, performing encoding at said vocoder, detecting only a speech section using information output as a result of the encoding, extracting a feature of the detected speech section, comparing the extracted feature with features of registered words stored during said training step, and selecting a registered word having a feature most similar to that of the input speech; and
a step of determining a result of the recognition to be right if a similarity of the registered word selected at said recognition step does not exceed a predetermined threshold and automatically dialing a telephone number corresponding to the recognized word, wherein said recognition step comprises extracting LSP parameters that have been encoded at said vocoder and transforming the extracted LSP parameters into pseudo-cepstrums. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
a first step of, if the user enters the telephone number and the speech corresponding to said telephone number, modulating the input speech to provide an output to said vocoder, dividing the speech signal into frames, and performing the encoding by the frame;
a second step of detecting only the actually voiced speech section from the input signal, using codebook gain as energy information, said codebook gain being output as the result of the encoding at said first step;
a third step of, if the speech section is detected at said second step, storing spectrum coefficients of the frames corresponding to the speech section as features, said coefficients being output as the result of the encoding; and
a fourth step of, if there is another telephone number to be entered, turning to said first step to repeat said steps.
-
-
3. The speech recognition method according to claim 2, wherein, in said third step, a line spectrum pair (LSP) coefficient output from said vocoder is used as the feature.
-
4. The speech recognition method according to claim 2, wherein said third step comprises the step of storing all encoded data of frames corresponding to the speech section for information of a result of the recognition with voice.
-
5. The speech recognition method according to claim 1, wherein said pseudo-cepstrum transforming step is defined as the following formula:
-
M;
LSP order.
-
-
6. The speech recognition method according to claim 1, wherein said recognition step comprises:
-
a first step of, if the user enters a destination to be called with voice, modulating the input speech to provide an output to said vocoder, dividing the speech signal into frames, and performing the encoding by the frame;
a second step of detecting only the actually voiced speech section from the input signal, using codebook gain as energy information, said codebook gain being output as the result of the encoding at said first step; and
a third step of, if the speech section is detected at said second step, extracting as features spectrum coefficients of frames corresponding to the speech section output as the result of the encoding, comparing the extracted features with the features of the registered words stored during said training step, and selecting the registered word having the feature most similar to that of the input speech.
-
-
7. The speech recognition method according to claim 6, wherein, in said third step, dynamic time warping (DTW) is used in comparing spectrum coefficients extracted from the input speech with spectrum coefficients of each word registered during said training step.
-
8. The speech recognition method according to claim 7, wherein said dynamic time warping comprises the steps of
forming a two-dimensional quadrature coordinate plane having M× - N trellis points (M is the number of frames of the input speech and N is the number of frames of a registered word) in order to matching two sequences of feature sets of the input speech and the stored registered word;
respectively drawing slant lines having a slope 1 starting from a start trellis point (1, 1) and an end trellis point (M, N) on said two dimensional quadrature coordinate plane and horizontally moving the two slant lines as much as a predetermined value (N2n, wherein N is the number of frames and n is a natural number) to establish a search section for matching;
calculating a distance between two features at each trellis point in a row within said search section and selecting a path through which a minimum distance between the two features is implemented;
repeating said minimum path selection step with respect to all the rows within said search section; and
dividing a minimum cumulative distance at said end trellis point (M, N) by a sum (M+N) of the two sequence to calculate a final matching score.
- N trellis points (M is the number of frames of the input speech and N is the number of frames of a registered word) in order to matching two sequences of feature sets of the input speech and the stored registered word;
-
9. The speech recognition method according to claim 8, wherein said distance between the two features at each trellis point is calculated such that differences of values corresponding to respective orders of the two features are all summed up and defined as the following equations:
-
Initial state;
D1,1=2d1, 1
Dm,n;
minimum cumulative distance at the trellis point (m, n)
-
-
10. The speech recognition method according to claim 9, wherein a value of the minimum cumulative distance at each trellis point (m, n) is substituted with a maximum integer value if the minimum cumulative distance value goes beyond a range of the integer.
-
11. The speech recognition method according to claim 10, wherein said trellis point (m, n) in each row within said search section has the minimum cumulative distance value of mth and nth features of the two sequences of a test pattern and reference pattern.
-
12. The speech recognition method according to claim 11, wherein a new path value of said trellis point (m, n) in each row within said search section is repeatedly generated by way of at least one function of a distance value directly shifting from a previous trellis point (m−
- 1, n−
1) to the present trellis point (m, n) and distance values indirectly shifting from two neighboring trellis points (m−
1, n) and (m, n−
1) to the present trellis point (m, n).
- 1, n−
-
13. The speech recognition method according to claim 12, wherein a minimum cumulative distance value in a very previous row is stored to obtain a minimum cumulative distance value in the present row.
-
14. The speech recognition method according to claim 1, wherein the recognition step further comprises applying a different pre-selection process to reduce a number of candidates codewords in the recognition step.
-
15. The speech recognition method according to claim 14, wherein said pre-selection step comprises the step of performing dynamic time warping (DTW) using only a part of spectrum information extracted from each frame to select a predetermined number of registered words having relatively high similarities and subsequently performing the DTW with respect to the selected registered words to finally select a registered word having the highest similarity to the input speech.
-
16. The speech recognition method according to claim 15, wherein said pre-selection step comprises the step of decreasing orders of the spectrum coefficient extracted from each frame and performing the DTW to select the predetermined number of registered words having relatively high similarities.
-
17. The speech recognition method according to claim 15, wherein said pre-selection step comprises the step of sub-sampling the frames to reduce the number of frames and performing the DTW to select the predetermined number of registered words having relatively high similarities.
-
18. The speech recognition method according to claim 15, wherein said pre-selection step comprises the step of decreasing orders of the spectrum coefficient extracted from each frame, sub-sampling the frames, and performing the DTW to select the predetermined number of registered words having relatively high similarities.
-
19. The speech recognition method according to claim 14, wherein said pre-selection step comprises the step of selecting a predetermined number of registered words having relatively high similarities using a linear matching method and subsequently performing dynamic time warping with respect to the selected registered words to finally select a registered word having the highest similarity to the input speech.
-
20. In a telephone modulating an input speech and having a built-in vocoder for encoding a modulated speech signal, a speech recognition method comprising:
-
a training step of, if a user enters a telephone number and a speech corresponding to said telephone number, performing the encoding at said vocoder, detecting only a speech section using information output as a result of the encoding, and extracting and storing a feature of the detected speech section;
a recognition step of, if an input speech is received, performing encoding at said vocoder, detecting only a speech section using information output as a result of the encoding, extracting a feature of the detected speech section, comparing the extracted feature with features of registered words stored during said training step, and selecting a registered word having a feature most similar to that of the input speech; and
a step of determining a result of the recognition to be right if a similarity of the registered word selected at said recognition step does not exceed a predetermined threshold and automatically dialing a telephone number corresponding to the recognized word, wherein said recognition step comprises extracting stored representations of audio signals encoded by the vocoder and transforming said stored representation of audio signals into pseudo-cepstrums. - View Dependent Claims (21, 22, 23)
a first step of, if the user enters a destination to be called with voice, modulating the input speech to provide an output to said vocoder, dividing the speech signal into frames, and performing the encoding by the frame;
a second step of detecting only the actually voiced speech section from the input signal, using codebook gain as energy information, said codebook gain being output as the result of the encoding at said first step; and
a third step of, if the speech section is detected at said second step, extracting as features spectrum coefficients of frames corresponding to the speech section output as the result of the encoding, comparing the extracted features with the features of the registered words stored during said training step, and selecting the registered word having the feature most similar to that of the input speech, wherein said third step comprises a different pre-selection step to reduce a number of registered words for the comparison prior to selection of the registered word having the feature most similar to that of the input speech.
-
-
24. In a telephone modulating an input speech and having a built-in vocoder for encoding a modulated speech signal, a speech recognition method comprising:
-
a training step of, if a user enters a telephone number and a speech corresponding to said telephone number, performing the encoding at said vocoder, detecting only a speech section using information output as a result of the encoding, and extracting and storing a feature of the detected speech section;
a recognition step of, if an input speech is received, performing encoding at said vocoder, detecting only a speech section using information output as a result of the encoding, extracting a feature of the detected speech section, comparing the extracted feature with features of registered words stored during said training step, and selecting a registered word having a feature most similar to that of the input speech; and
a step of determining a result of the recognition to be right if a similarity of the registered word selected at said recognition step does not exceed a predetermined threshold and automatically dialing a telephone number corresponding to the recognized word, wherein the recognition step further comprises applying a different comparing and selecting criteria in a separate pre-selection process prior to selection of the registered word having the feature most similar to that of the input speech. - View Dependent Claims (25, 26, 27)
-
Specification