Speech coding apparatus having speaker dependent prototypes generated from nonuser reference data
First Claim
1. A speech coding apparatus comprising:
- means for measuring the value of at least one feature of an utterance during each of a series of successive time intervals to produce a series of feature vector signals representing the feature values;
means for storing a plurality of prototype vector signals, each prototype vector signal having at least one parameter value, each prototype vector signal having a unique identification value;
means for comparing the closeness of the feature value of a feature vector signal to the parameter values of the prototype vector signals to obtain prototype match scores for the feature vector signal and each prototype vector signal; and
means for outputting at least the identification value of the prototype vector signal having the best prototype match score as a coded representation signal of the feature vector signal;
characterized in that the apparatus further comprises;
means for storing a plurality of reference feature vector signals, each reference feature vector signal representing the value of at least one feature of one or more utterances of one or more speakers in a reference set of speakers during each of a plurality of successive time intervals;
means for storing a plurality of measured training feature vector signals, each measured training feature vector signal representing the value of at least one feature of one or more utterances of a speaker not in the reference set during each of a plurality of successive time intervals;
means for transforming at least one reference feature vector signal into a synthesized training feature vector signal; and
means for generating the prototype vector signals from both the measured training vector signals and from the synthesized training vector signal.
1 Assignment
0 Petitions
Accused Products
Abstract
A speech coding apparatus and method for use in a speech recognition apparatus and method. The value of at least one feature of an utterance is measured during each of a series of successive time intervals to produce a series of feature vector signals representing the feature values. A plurality of prototype vector signals, each having at least one parameter value and a unique identification value are stored. The closeness of the feature vector signal is compared to the parameter values of the prototype vector signals to obtain prototype match scores for the feature value signal and each prototype vector signal. The identification value of the prototype vector signal having the best prototype match score is output as a coded representation signal of the feature vector signal. Speaker-dependent prototype vector signals are generated from both synthesized training vector signals and measured training vector signals. The synthesized training vector signals are transformed reference feature vector signals representing the values of features of one or more utterances of one or more speakers in a reference set of speakers. The measured training feature vector signals represent the values of features of one or more utterances of a new speaker/user not in the reference set.
54 Citations
39 Claims
-
1. A speech coding apparatus comprising:
-
means for measuring the value of at least one feature of an utterance during each of a series of successive time intervals to produce a series of feature vector signals representing the feature values; means for storing a plurality of prototype vector signals, each prototype vector signal having at least one parameter value, each prototype vector signal having a unique identification value; means for comparing the closeness of the feature value of a feature vector signal to the parameter values of the prototype vector signals to obtain prototype match scores for the feature vector signal and each prototype vector signal; and means for outputting at least the identification value of the prototype vector signal having the best prototype match score as a coded representation signal of the feature vector signal; characterized in that the apparatus further comprises; means for storing a plurality of reference feature vector signals, each reference feature vector signal representing the value of at least one feature of one or more utterances of one or more speakers in a reference set of speakers during each of a plurality of successive time intervals; means for storing a plurality of measured training feature vector signals, each measured training feature vector signal representing the value of at least one feature of one or more utterances of a speaker not in the reference set during each of a plurality of successive time intervals; means for transforming at least one reference feature vector signal into a synthesized training feature vector signal; and means for generating the prototype vector signals from both the measured training vector signals and from the synthesized training vector signal. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A speech coding method comprising:
-
measuring the value of at least one feature of an utterance during each of a series of successive time intervals to produce a series of feature vector signals representing the feature values; storing a plurality of prototype vector signals, each prototype vector signal having at least one parameter value, each prototype vector signal having a unique identification value; comparing the closeness of the feature value of a feature vector signal to the parameter values of the prototype vector signals to obtain prototype match scores for the feature vector signal and each prototype vector signal; and outputting at least the identification value of the prototype vector signal having the best prototype match score as a coded representation signal of the feature vector signal; characterized in that the method further comprises; storing a plurality of reference feature vector signals, each reference feature vector signal representing the value of at least one feature of one or more utterances of one or more speakers in a reference set of speakers during each of a plurality of successive time intervals; storing a plurality of measured training feature vector signals, each measured training feature vector signal representing the value of at least one feature of one or more utterances of a speaker not in the reference set during each of a plurality of successive time intervals; transforming at least one reference feature vector signal into a synthesized training feature vector signal; and generating the prototype vector signals from both the measured training vector signals and from the synthesized training vector signal. - View Dependent Claims (11, 12, 13, 14, 15, 16)
-
-
17. A speech recognition apparatus comprising:
-
means for measuring the value of at least one feature of an utterance during each of a series of successive time intervals to produce a series of feature vector signals representing the feature values; means for storing a plurality of prototype vector signals, each prototype vector signal having at least one parameter value, each prototype vector signal having a unique identification value; means for comparing the closeness of the feature value of each feature vector signal to the parameter values of the prototype vector signals to obtain prototype match scores for each feature vector signal and each prototype vector signal; means for outputting at least the identification values of the prototype vector signals having the best prototype match score for each feature vector signal as a sequence of coded representations of the utterance; means for generating a match score for each of a plurality of speech units, each match score comprising an estimate of the closeness of a match between a model of the speech unit and the sequence of coded representations of the utterance, each speech unit comprising one or more speech subunits; means for identifying one or more best candidate speech units having the best match scores; and means for outputting at least one speech subunit of one or more of the best candidate speech units; characterized in that the apparatus further comprises; means for storing a plurality of reference feature vector signals, each reference feature vector signal representing the value of at least one feature of one or more utterances of one or more speakers in a reference set of speakers during each of a plurality of successive time intervals; means for storing a plurality of measured training feature vector signals, each measured training feature vector signal representing the value of at least one feature of one or more utterances of a speaker not in the reference set during each of a plurality of successive time intervals; means for transforming at least one reference feature vector signal into a synthesized training feature vector signal; and means for generating the prototype vector signals from both the measured training vector signals and from the synthesized training vector signal. - View Dependent Claims (18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32)
-
-
33. A speech recognition method comprising:
-
measuring the value of at least one feature of an utterance during each of a series of successive time intervals to produce a series of feature vector signals representing the feature values; storing a plurality of prototype vector signals, each prototype vector signal having at least one parameter value, each prototype vector signal having a unique identification value; comparing the closeness of the feature value of each feature vector signal to the parameter values of the prototype vector signals to obtain prototype match scores for each feature vector signal and each prototype vector signal; outputting at least the identification values of the prototype vector signals having the best prototype match score for each feature vector signal as a sequence of coded representations of the utterance; generating a match score for each of a plurality of speech units, each match score comprising an estimate of the closeness of a match between a model of the speech unit and the sequence of coded representations of the utterance, each speech unit comprising one or more speech subunits; identifying one or more best candidate speech units having the best match scores; and outputting at least one speech subunit of one or more of the best candidate speech units; characterized in that the method further comprises; storing a plurality of reference feature vector signals, each reference feature vector signal representing the value of at least one feature of one or more utterances of one or more speakers in a reference set of speakers during each of a plurality of successive time intervals; storing a plurality of measured training feature vector signals, each measured training feature vector signal representing the value of at least one feature of one or more utterances of a speaker not in the reference set during each of a plurality of successive time intervals; transforming at least one reference feature vector signal into a synthesized training feature vector signal; and generating the prototype vector signals from both the measured training vector signals and from the synthesized training vector signal. - View Dependent Claims (34, 35, 36, 37, 38, 39)
-
Specification