Neural network acoustic and visual speech recognition system training method and apparatus

US 5,621,858 A
Filed: 10/14/1993
Issued: 04/15/1997
Est. Priority Date: 05/26/1992
Status: Expired due to Fees

First Claim

Patent Images

1. A training system for a speech recognition system comprising:

(a) a speech recognition system for recognizing utterances belonging to a pre-established set of allowable candidate utterances using acoustic speech signals and selected concomitant dynamic visual facial feature motion between selected facial features associated with acoustic speech generation, comprising,(i) an acoustic feature extraction apparatus for converting signals representative of dynamic acoustic speech into a corresponding dynamic acoustic feature vector set of signals,(ii) a dynamic visual feature extraction apparatus for converting signals representative of the selected concomitant dynamic facial feature motion associated with acoustic speech generation into a corresponding dynamic visual feature vector set of signals, and(iii) a time delay neural network classifying apparatus with an input-to-output transfer characteristic controlled by a set of adjustable synaptic weights for generating an output response vector representing a conditional probability distribution of the allowable candidate speech utterances by accepting and operating on a set of corresponding time-delayed dynamic acoustic and visual feature vector pairs that are respectively supplied by the acoustic and visual feature extraction apparatus to a set of inputs; and

(b) a control system comprising a control processor and an associated memory coupled to the speech recognition system for initializing parameters, for controlling the speech recognition system, for storing acoustic and visual output exemplar vectors, for computing output errors, and for adjusting the time delay neural network classifying apparatus synaptic weights based on the computed errors in accordance with a prescribed training procedure.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The apparatus for the recognition of speech includes an acoustic preprocessor, a visual preprocessor, and a speech classifier that operates on the acoustic and visual preprocessed data. The acoustic preprocessor comprises a log mel spectrum analyzer that produces an equal mel bandwidth log power spectrum. The visual processor detects the motion of a set of fiducial markers on the speaker'"'"'s face and extracts a set of normalized distance vectors describing lip and mouth movement. The speech classifier uses a multilevel time-delay neural network operating on the preprocessed acoustic and visual data to form an output probability distribution that indicates the probability of each candidate utterance having been spoken, based on the acoustic and visual data. The training system includes the speech recognition apparatus and a control processor with an associated memory. Noisy acoustic input training data together with visual data is used to generate acoustic and visual feature training vectors for processing by the speech classifier. A control computer adjusts the synaptic weights of the speech classifier based upon the noisy input training data and exemplar output vectors for producing a robustly trained classifier based on the analogous visual counterpart of the Lombard effect.

88 Citations

View as Search Results

10 Claims

1. A training system for a speech recognition system comprising:
- (a) a speech recognition system for recognizing utterances belonging to a pre-established set of allowable candidate utterances using acoustic speech signals and selected concomitant dynamic visual facial feature motion between selected facial features associated with acoustic speech generation, comprising,(i) an acoustic feature extraction apparatus for converting signals representative of dynamic acoustic speech into a corresponding dynamic acoustic feature vector set of signals,(ii) a dynamic visual feature extraction apparatus for converting signals representative of the selected concomitant dynamic facial feature motion associated with acoustic speech generation into a corresponding dynamic visual feature vector set of signals, and(iii) a time delay neural network classifying apparatus with an input-to-output transfer characteristic controlled by a set of adjustable synaptic weights for generating an output response vector representing a conditional probability distribution of the allowable candidate speech utterances by accepting and operating on a set of corresponding time-delayed dynamic acoustic and visual feature vector pairs that are respectively supplied by the acoustic and visual feature extraction apparatus to a set of inputs; and
  
  (b) a control system comprising a control processor and an associated memory coupled to the speech recognition system for initializing parameters, for controlling the speech recognition system, for storing acoustic and visual output exemplar vectors, for computing output errors, and for adjusting the time delay neural network classifying apparatus synaptic weights based on the computed errors in accordance with a prescribed training procedure.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The system of claim 1 wherein the time delay neural network classifying apparatus comprises:
    - (a) a tapped delay line input layer for accepting a sequence of corresponding paired acoustic and visual time-varying feature vectors and for producing a multiplicity of sequential acoustic and visual feature vectors, in parallel, at the delay line output taps;
      
      (b) a hidden layer of neural cells coupled to the output taps of the tapped delay line for producing enhanced time dependent features at the hidden layer output;
      
      (c) a classification layer of neural cells, coupled to the output of the hidden layer of neural cells, for generating a set of output time-varying signals at the classification layer output, each classifying layer output time-varying signal representative of a probability of a corresponding utterance being present; and
      
      (d) an averaging layer apparatus, connected to the outputs of the classification layer, for generating a set of time averaged varying outputs, one for each allowable utterance type, representative of a conditional probability that the associated utterance was spoken.
  - 3. The system of claim 1 wherein the time delay neural network classifying apparatus accepts quantized dynamic acoustic and visual training vectors from the control system.
  - 4. The system of claim 1 wherein the neural network classifying apparatus accepts analog dynamic acoustic and visual feature training vectors from the control system, further comprising means for converting quantized dynamic acoustic and visual feature training vectors from the control processor memory into analog training vectors.
  - 5. The system of claim 1 wherein the time delay neural network classifying apparatus has a quantized output.
  - 6. The system of claim 1 further comprising an error generator coupled to and controlled by the control processor and coupled to the output of the time delay neural network classifying apparatus, for accepting an output response vector generated by a visual and an acoustic input training vector, for comparing an exemplar output vector to the output response vector and producing a comparison result, outputting the comparison result to the control processor for use in adjusting the time delay neural network synaptic weights.

7. A method for training a speech recognition system for recognizing utterances belonging to a pre-established set of allowable candidate utterances using acoustic speech signals and selected concomitant dynamic visual facial feature motion between selected facial features associated with acoustic speech generation, the speech recognition system comprising an acoustic feature extraction apparatus for converting signals representative of dynamic acoustic speech into a corresponding dynamic acoustic feature vector set of signals, a dynamic visual feature extraction apparatus for converting signals representative of the selected concomitant dynamic facial feature motion associated with acoustic speech generation into a corresponding dynamic visual feature vector set of signals, a time delay neural network classifying apparatus with an input-to-output transfer characteristic controlled by a set of adjustable synaptic weights for generating an output response vector representing a conditional probability distribution of the allowable candidate speech utterances by accepting and operating on a set of corresponding time-delayed dynamic acoustic and visual feature vector pairs that are respectively supplied to a set of inputs by the acoustic and visual feature extraction apparatus, and a control system for controlling the speech recognition system, for storing dynamic visual and dynamic acoustic feature training vectors, for applying the feature training vectors to the time delay neural network classifying apparatus and computing output errors, and for adjusting the time delay neural network classifying apparatus synaptic weights based on the computed output errors, the method comprising:
- (a) initializing the time delay neural network classification apparatus synaptic weights;
  
  (b) applying a corresponding pair of dynamic acoustic and a visual feature training vectors to the set of inputs of the time delay neural network classification apparatus and generating an output response vector;
  
  (c) comparing the output response vector with an exemplar output response vector stored in the control system associated memory corresponding to the training vectors applied to the neural network classifier and producing an error measure; and
  
  (d) adjusting the time delay neural network classifying apparatus synaptic weights in accordance with a prescribed algorithm.
- View Dependent Claims (8, 9, 10)
- - 8. The method of claim 7 wherein the acoustic and visual feature training vectors each have a prescribed level of noise.
  - 9. The method of claim 7 wherein step (b) the pair of acoustic and visual feature training vectors have a different prescribed level of noise each time steps (b) through (d) are repeated.
  - 10. The method of claim 9 further comprising the step of generating the acoustic and visual feature training vectors by processing a noisy acoustic speech utterance together with a concomitant visual speech signal through a separate acoustic and a separate dynamic visual feature extraction apparatus.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Ricoh Company Limited, Ricoh Corporation (Ricoh Company Limited)
Original Assignee
Ricoh Company Limited, Ricoh Corporation (Ricoh Company Limited)
Inventors
Wolff, Gregory J., Stork, David G.
Primary Examiner(s)
MacDonald, Allen R.
Assistant Examiner(s)
Dorvil, Richemond

Application Number

US08/137,318
Time in Patent Office

1,279 Days
Field of Search

395/2, 395/2.41, 395/2.39, 382/156, 382/155, 382/157, 382/158
US Class Current

704/232
CPC Class Codes

G06F 18/256   of results relating to diff...

G06N 3/049   Temporal neural networks, e...

G06V 30/248   involving plural approaches...

G06V 40/168   Feature extraction; Face re...

G10L 15/02   Feature extraction for spee...

G10L 15/16   using artificial neural net...

G10L 15/25   using position of the lips,...

G10L 25/18   the extracted parameters be...

Neural network acoustic and visual speech recognition system training method and apparatus

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

88 Citations

10 Claims

Specification

Solutions

Use Cases

Quick Links

Neural network acoustic and visual speech recognition system training method and apparatus

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

88 Citations

10 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links