Speech coding apparatus having speaker dependent prototypes generated from nonuser reference data

US 5,278,942 A
Filed: 12/05/1991
Issued: 01/11/1994
Est. Priority Date: 12/05/1991
Status: Expired due to Fees

First Claim

Patent Images

1. A speech coding apparatus comprising:

means for measuring the value of at least one feature of an utterance during each of a series of successive time intervals to produce a series of feature vector signals representing the feature values;

means for storing a plurality of prototype vector signals, each prototype vector signal having at least one parameter value, each prototype vector signal having a unique identification value;

means for comparing the closeness of the feature value of a feature vector signal to the parameter values of the prototype vector signals to obtain prototype match scores for the feature vector signal and each prototype vector signal; and

means for outputting at least the identification value of the prototype vector signal having the best prototype match score as a coded representation signal of the feature vector signal;

characterized in that the apparatus further comprises;

means for storing a plurality of reference feature vector signals, each reference feature vector signal representing the value of at least one feature of one or more utterances of one or more speakers in a reference set of speakers during each of a plurality of successive time intervals;

means for storing a plurality of measured training feature vector signals, each measured training feature vector signal representing the value of at least one feature of one or more utterances of a speaker not in the reference set during each of a plurality of successive time intervals;

means for transforming at least one reference feature vector signal into a synthesized training feature vector signal; and

means for generating the prototype vector signals from both the measured training vector signals and from the synthesized training vector signal.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A speech coding apparatus and method for use in a speech recognition apparatus and method. The value of at least one feature of an utterance is measured during each of a series of successive time intervals to produce a series of feature vector signals representing the feature values. A plurality of prototype vector signals, each having at least one parameter value and a unique identification value are stored. The closeness of the feature vector signal is compared to the parameter values of the prototype vector signals to obtain prototype match scores for the feature value signal and each prototype vector signal. The identification value of the prototype vector signal having the best prototype match score is output as a coded representation signal of the feature vector signal. Speaker-dependent prototype vector signals are generated from both synthesized training vector signals and measured training vector signals. The synthesized training vector signals are transformed reference feature vector signals representing the values of features of one or more utterances of one or more speakers in a reference set of speakers. The measured training feature vector signals represent the values of features of one or more utterances of a new speaker/user not in the reference set.

54 Citations

View as Search Results

39 Claims

1. A speech coding apparatus comprising:
- means for measuring the value of at least one feature of an utterance during each of a series of successive time intervals to produce a series of feature vector signals representing the feature values;
  
  means for storing a plurality of prototype vector signals, each prototype vector signal having at least one parameter value, each prototype vector signal having a unique identification value;
  
  means for comparing the closeness of the feature value of a feature vector signal to the parameter values of the prototype vector signals to obtain prototype match scores for the feature vector signal and each prototype vector signal; and
  
  means for outputting at least the identification value of the prototype vector signal having the best prototype match score as a coded representation signal of the feature vector signal;
  
  characterized in that the apparatus further comprises;
  
  means for storing a plurality of reference feature vector signals, each reference feature vector signal representing the value of at least one feature of one or more utterances of one or more speakers in a reference set of speakers during each of a plurality of successive time intervals;
  
  means for storing a plurality of measured training feature vector signals, each measured training feature vector signal representing the value of at least one feature of one or more utterances of a speaker not in the reference set during each of a plurality of successive time intervals;
  
  means for transforming at least one reference feature vector signal into a synthesized training feature vector signal; and
  
  means for generating the prototype vector signals from both the measured training vector signals and from the synthesized training vector signal.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. A speech coding apparatus as claimed in claim 1, characterized in that the transforming means applies a nonlinear transformation to the reference feature vector signal to produce the synthesized training feature vector signal.
  - 3. A speech coding apparatus as claimed in claim 2, characterized in that the nonlinear transformation is a piecewise linear transformation.
  - 4. A speech coding apparatus as claimed in claim 3, characterized in that the nonlinear transformation maps the reference feature vector signals to the training feature vector signals.
  - 5. A speech coding apparatus as claimed in claim 3, characterized in that a first subset of the reference feature vector signals has a mean, a first subset of the training feature vector signals has a mean, and the nonlinear transformation maps the mean of the first subset of the reference feature vector signals to the mean of the first subset of the training feature vector signals.
  - 6. A speech coding apparatus as claimed in claim 5, characterized in that the first subset of the reference feature vector signals has a variance, the first subset of the training feature vector signals has a variance, and the nonlinear transformation maps the variance of the first subset of the reference feature vector signals to the variance of the first subset of the training feature vector signals.
  - 7. A speech coding apparatus as claimed in claim 5, characterized in that a subgroup of the first subset of the reference feature vector signals has a mean, a subgroup of the first subset of the training feature vector signals has a mean, and the nonlinear transformation maps the mean of the subgroup of the first subset of the reference feature vector signals to the mean of the subgroup of the first subset of the training feature vector signals.
  - 8. A speech coding apparatus as claimed in claim 5, characterized in that the means for storing a plurality of prototype vector signals comprises electronic read/write memory.
  - 9. A speech coding apparatus as claimed in claim 8, characterized in that the measuring means comprises a microphone.

10. A speech coding method comprising:
- measuring the value of at least one feature of an utterance during each of a series of successive time intervals to produce a series of feature vector signals representing the feature values;
  
  storing a plurality of prototype vector signals, each prototype vector signal having at least one parameter value, each prototype vector signal having a unique identification value;
  
  comparing the closeness of the feature value of a feature vector signal to the parameter values of the prototype vector signals to obtain prototype match scores for the feature vector signal and each prototype vector signal; and
  
  outputting at least the identification value of the prototype vector signal having the best prototype match score as a coded representation signal of the feature vector signal;
  
  characterized in that the method further comprises;
  
  storing a plurality of reference feature vector signals, each reference feature vector signal representing the value of at least one feature of one or more utterances of one or more speakers in a reference set of speakers during each of a plurality of successive time intervals;
  
  storing a plurality of measured training feature vector signals, each measured training feature vector signal representing the value of at least one feature of one or more utterances of a speaker not in the reference set during each of a plurality of successive time intervals;
  
  transforming at least one reference feature vector signal into a synthesized training feature vector signal; and
  
  generating the prototype vector signals from both the measured training vector signals and from the synthesized training vector signal.
- View Dependent Claims (11, 12, 13, 14, 15, 16)
- - 11. A speech coding method as claimed in claim 10, characterized in that the transforming step applies a nonlinear transformation to the reference feature vector signal to produce the synthesized training feature vector signal.
  - 12. A speech coding method as claimed in claim 11, characterized in that the nonlinear transformation is a piecewise linear transformation.
  - 13. A speech coding method as claimed in claim 12, characterized in that the nonlinear transformation maps the reference feature vector signals to the training feature vector signals.
  - 14. A speech coding method as claimed in claim 12, characterized in that a first subset of the reference feature vector signals has a mean, a first subset of the training feature vector signals has a mean, and the nonlinear transformation maps the mean of the first subset of the reference feature vector signals to the mean of the first subset of the training feature vector signals.
  - 15. A speech coding method as claimed in claim 14, characterized in that the first subset of the reference feature vector signals has a variance, the first subset of the training feature vector signals has a variance, and the nonlinear transformation maps the variance of the first subset of the reference feature vector signals to the variance of the first subset of the training feature vector signals.
  - 16. A speech coding method as claimed in claim 14, characterized in that a subgroup of the first subset of the reference feature vector signals has a mean, a subgroup of the first subset of the training feature vector signals has a mean, and the nonlinear transformation maps the mean of the subgroup of the first subset of the reference feature vector signals to the means of the subgroup of the first subset of the training feature vector signals.

17. A speech recognition apparatus comprising:
- means for measuring the value of at least one feature of an utterance during each of a series of successive time intervals to produce a series of feature vector signals representing the feature values;
  
  means for storing a plurality of prototype vector signals, each prototype vector signal having at least one parameter value, each prototype vector signal having a unique identification value;
  
  means for comparing the closeness of the feature value of each feature vector signal to the parameter values of the prototype vector signals to obtain prototype match scores for each feature vector signal and each prototype vector signal;
  
  means for outputting at least the identification values of the prototype vector signals having the best prototype match score for each feature vector signal as a sequence of coded representations of the utterance;
  
  means for generating a match score for each of a plurality of speech units, each match score comprising an estimate of the closeness of a match between a model of the speech unit and the sequence of coded representations of the utterance, each speech unit comprising one or more speech subunits;
  
  means for identifying one or more best candidate speech units having the best match scores; and
  
  means for outputting at least one speech subunit of one or more of the best candidate speech units;
  
  characterized in that the apparatus further comprises;
  
  means for storing a plurality of reference feature vector signals, each reference feature vector signal representing the value of at least one feature of one or more utterances of one or more speakers in a reference set of speakers during each of a plurality of successive time intervals;
  
  means for storing a plurality of measured training feature vector signals, each measured training feature vector signal representing the value of at least one feature of one or more utterances of a speaker not in the reference set during each of a plurality of successive time intervals;
  
  means for transforming at least one reference feature vector signal into a synthesized training feature vector signal; and
  
  means for generating the prototype vector signals from both the measured training vector signals and from the synthesized training vector signal.
- View Dependent Claims (18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32)
- - 18. A speech recognition apparatus as claimed in claim 17, characterized in that the transforming means applies a nonlinear transformation to the reference feature vector signal to produce the synthesized training feature vector signal.
  - 19. A speech recognition apparatus as claimed in claim 18, characterized in that the nonlinear transformation is a piecewise linear transformation.
  - 20. A speech recognition apparatus as claimed in claim 19, characterized in that the nonlinear transformation maps the reference feature vector signals to the training feature vector signals.
  - 21. A speech recognition apparatus as claimed in claim 19, characterized in that a first subset of the reference feature vector signals has a mean, a first subset of the training feature vector signals has a mean, and the nonlinear transformation maps the mean of the first subset of the reference feature vector signals to the mean of the first subset of the training feature vector signals.
  - 22. A speech recognition apparatus as claimed in claim 21, characterized in that the first subset of the reference feature vector signals has a variance, the first subset of the training feature vector signals has a variance, and the nonlinear transformation maps the variance of the first subset of the reference feature vector signals to the variance of the first subset of the training feature vector signals.
  - 23. A speech recognition apparatus as claimed in claim 21, characterized in that a subgroup of the first subset of the reference feature vector signals has a mean, a subgroup of the first subset of the training feature vector signals has a mean, and the nonlinear transformation maps the mean of the subgroup of the first subset of the reference feature vector signals to the mean of the subgroup of the first subset of the training feature vector signals.
  - 24. A speech recognition apparatus as claimed in claim 21, characterized in that the means for storing a plurality of prototype vector signals comprises electronic read/write memory.
  - 25. A speech recognition apparatus as claimed in claim 24, characterized in that the measuring means comprises a microphone.
  - 26. A speech recognition apparatus as claimed in claim 25, characterized in that the speech subunit output means comprises a video display.
  - 27. A speech recognition apparatus as claimed in claim 26, characterized in that the video display comprises a cathode ray tube.
  - 28. A speech recognition apparatus as claimed in claim 26, characterized in that the video display comprises a liquid crystal display.
  - 29. A speech recognition apparatus as claimed in claim 26, characterized in that the video display comprises a printer.
  - 30. A speech recognition apparatus as claimed in claim 25, characterized in that the speech subunit output means comprises an audio generator.
  - 31. A speech recognition apparatus as claimed in claim 30, characterized in that the audio generator comprises a loudspeaker.
  - 32. A speech recognition apparatus as claimed in claim 30, characterized in that the audio generator comprises a headphone.

33. A speech recognition method comprising:
- measuring the value of at least one feature of an utterance during each of a series of successive time intervals to produce a series of feature vector signals representing the feature values;
  
  storing a plurality of prototype vector signals, each prototype vector signal having at least one parameter value, each prototype vector signal having a unique identification value;
  
  comparing the closeness of the feature value of each feature vector signal to the parameter values of the prototype vector signals to obtain prototype match scores for each feature vector signal and each prototype vector signal;
  
  outputting at least the identification values of the prototype vector signals having the best prototype match score for each feature vector signal as a sequence of coded representations of the utterance;
  
  generating a match score for each of a plurality of speech units, each match score comprising an estimate of the closeness of a match between a model of the speech unit and the sequence of coded representations of the utterance, each speech unit comprising one or more speech subunits;
  
  identifying one or more best candidate speech units having the best match scores; and
  
  outputting at least one speech subunit of one or more of the best candidate speech units;
  
  characterized in that the method further comprises;
  
  storing a plurality of reference feature vector signals, each reference feature vector signal representing the value of at least one feature of one or more utterances of one or more speakers in a reference set of speakers during each of a plurality of successive time intervals;
  
  storing a plurality of measured training feature vector signals, each measured training feature vector signal representing the value of at least one feature of one or more utterances of a speaker not in the reference set during each of a plurality of successive time intervals;
  
  transforming at least one reference feature vector signal into a synthesized training feature vector signal; and
  
  generating the prototype vector signals from both the measured training vector signals and from the synthesized training vector signal.
- View Dependent Claims (34, 35, 36, 37, 38, 39)
- - 34. A speech recognition method as claimed in claim 33, characterized in that the step of transforming applies a nonlinear transformation to the reference feature vector signal to produce the synthesized training feature vector signal.
  - 35. A speech recognition method as claimed in claim 34, characterized in that the nonlinear transformation is a piecewise linear transformation.
  - 36. A speech recognition method as claimed in claim 35, characterized in that the nonlinear transformation maps the reference feature vector signals to the training feature vector signals.
  - 37. A speech recognition method as claimed in claim 35, characterized in that a first subset of the reference feature vector signals has a mean, a first subset of the training feature vector signals has a mean, and the nonlinear transformation maps the mean of the first subset of the reference feature vector signals to the mean of the first subset of the training feature vector signals.
  - 38. A speech recognition method as claimed in claim 37, characterized in that the first subset of the reference feature vector signals has a variance, the first subset of the training feature vector signals has a variance, and the nonlinear transformation maps the variance of the first subset of the reference feature vector signals to the variance of the first subset of the training feature vector signals.
  - 39. A speech recognition method as claimed in claim 37, characterized in that a subgroup of the first subset of the reference feature vector signals has a mean, a subgroup of the first subset of the training feature vector signals has a mean, and the nonlinear transformation maps the mean of the subgroup of the first subset of the reference feature vector signals to the mean of the subgroup of the first subset of the training feature vector signals.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Bellegarda, Jerome R., Nahamoo, David, Picheny, Michael A., Gopalakrishnan, Ponani S., Bahl, Lalit R., Nadas, Arthur J., De Souza, Peter V.
Primary Examiner(s)
Fleming, Michael R.
Assistant Examiner(s)
Doerrler, Michelle

Application Number

US07/802,678
Time in Patent Office

768 Days
Field of Search

381/29-45, 395/2
US Class Current

704/200
CPC Class Codes

G10L 15/02 Feature extraction for spee...

G10L 15/063 Training

Speech coding apparatus having speaker dependent prototypes generated from nonuser reference data

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

54 Citations

39 Claims

Specification

Solutions

Use Cases

Quick Links

Speech coding apparatus having speaker dependent prototypes generated from nonuser reference data

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

54 Citations

39 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links