Method of speech recognition
First Claim
1. A method of speech recognition, comprising the steps of:
- generating "m" feature parameters every frame from reference speech which is spoken by at least one speaker and which represents recognition-object words, where "m" denotes a preset integer;
previously generating "n" types of standard patterns of a set of preset phonemes on the basis of speech data of a plurality of speakers, where "n" denotes a preset integer;
executing a matching between the feature parameters of the reference speech and each of the standard patterns, and generating a vector of "n" reference similarities between the feature parameters of the reference speech and each of the standard patterns every frame;
generating temporal sequences of the reference similarity vectors of respective frames, the reference similarity vector sequences corresponding to the recognition-object words respectively;
previously registering the reference similarity vector sequences as dictionary similarity vector sequences;
analyzing input speech to be recognized, and generating "m" feature parameters from the input speech;
executing a matching between the feature parameters of the input speech and the standard patterns, and generating a vector of "n" input-speech similarities between the feature parameters of the input speech and the standard patterns every frame;
generating a temporal sequence of the input-speech similarity vectors of respective frames;
collating the input-speech similarity vector sequence with the dictionary similarity-vector sequences; and
recognizing the input speech based on a result of the collating step.
1 Assignment
0 Petitions
Accused Products
Abstract
A set of "m" feature parameters is generated every frame from reference speech which is spoken by at least one speaker and which represents recognition-object words, where "m" denotes a preset integer. A set of "n" types of standard patterns is previously generated on the basis of speech data of a plurality of speakers, where "n" denotes a preset integer. Matching between the feature parameters of the reference speech and each of the standard patterns is executed to generate a vector of "n" reference similarities between the feature parameters of the reference speech and each of the standard patterns every frame. The reference similarity vectors of respective frames are arranged into temporal sequences corresponding to the recognition-object words respectively. The reference similarity vector sequences are previously registered as dictionary similarity vector sequences. Input speech to be recognized is analyzed to generate "m" feature parameters from the input speech. Matching between the feature parameters of the input speech and the standard patterns is executed to generate a vector of "n" input-speech similarities between the feature parameters of the input speech and the standard patterns every frame. The input-speech similarity vectors of respective frames are arranged into a temporal sequence. The input-speech similarity vector sequence is collated with the dictionary similarity vector sequences to recognize the input speech.
275 Citations
35 Claims
-
1. A method of speech recognition, comprising the steps of:
-
generating "m" feature parameters every frame from reference speech which is spoken by at least one speaker and which represents recognition-object words, where "m" denotes a preset integer; previously generating "n" types of standard patterns of a set of preset phonemes on the basis of speech data of a plurality of speakers, where "n" denotes a preset integer; executing a matching between the feature parameters of the reference speech and each of the standard patterns, and generating a vector of "n" reference similarities between the feature parameters of the reference speech and each of the standard patterns every frame; generating temporal sequences of the reference similarity vectors of respective frames, the reference similarity vector sequences corresponding to the recognition-object words respectively; previously registering the reference similarity vector sequences as dictionary similarity vector sequences; analyzing input speech to be recognized, and generating "m" feature parameters from the input speech; executing a matching between the feature parameters of the input speech and the standard patterns, and generating a vector of "n" input-speech similarities between the feature parameters of the input speech and the standard patterns every frame; generating a temporal sequence of the input-speech similarity vectors of respective frames; collating the input-speech similarity vector sequence with the dictionary similarity-vector sequences; and recognizing the input speech based on a result of the collating step. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23)
-
-
24. A method of speech recognition, comprising the steps of:
-
previously setting a set of words in consideration of phonetic environments; making at least one speaker speak the word set, and obtaining "m" feature parameters therefrom every frame; previously generating "n" types of standard patterns of a set of preset phonemes from speech data generated by many speakers; executing a matching between the feature parameters and each of the standard patterns to obtain a vector of "n" similarities every frame; generating a temporal sequence pattern from the similarity vector; extracting speech fragments from the temporal sequence pattern, and registering the speech fragments as a speech fragment dictionary; generating a connection sequence of the speech fragments or a temporal sequence pattern of similarity vectors for each of speech-recognition object words, wherein the temporal sequence pattern of the similarity vectors is made by combining the speech fragments in the speech fragment dictionary; storing the connection sequence of the speech fragments or the temporal sequence pattern of the similarity vectors into a recognition-object dictionary for each of the recognition-object words; analyzing input speech to obtain "m" feature parameter every frame; executing a matching between the input-speech feature parameters and each of the standard patterns to obtain a temporal sequence of vectors of "n" similarities; performing one of a first and second collating steps, wherein said first collating step comprises collating the input-speech temporal similarity vector sequence with each of the temporal sequence patterns of the similarities which are registered in respective items of the recognition-object dictionary and said second collating step comprises collating the input-speech temporal similarity vector sequence with each of the temporal sequence patterns of the similarities which are generated according to the connection sequences of the speech fragments, and recognizing the input speech based on a result of the collating step as performed. - View Dependent Claims (25, 26, 27, 28, 29, 30, 31, 32, 33)
-
-
34. A method of speech recognition, comprising the steps of:
-
extracting feature parameters from input speech representing one of preset words; calculating a set of input-speech similarities between the input-speech feature parameters and standard patterns of a set of preset phonemes; collating the set of the input-speech similarities with sets of predetermined reference similarities which correspond to the preset words respectively; and recognizing the input speech based on a result of said collating step.
-
-
35. A method of speech recognition, comprising the steps of:
-
extracting feature parameters from input speech representing one of preset words; calculating a set of input-speech similarities between the input-speech feature parameters and standard patterns of a set of preset phonemes; calculating a set of time-domain variations in the input-speech similarities; collating the set of the input-speech similarities with sets of predetermined reference similarities which correspond to the preset words respectively; and collating the set of the time-domain variations in the input-speech similarities with sets of predetermined reference time-domain variations which correspond to the preset words respectively; and recognizing the input speech based on results of said two collating steps.
-
Specification