Facial feature extraction method and apparatus for a neural network acoustic and visual speech recognition system

US 5,680,481 A
Filed: 06/09/1995
Issued: 10/21/1997
Est. Priority Date: 05/26/1992
Status: Expired due to Fees

First Claim

Patent Images

1. A method for extracting a visual feature vector from a sequence of images, each having a plurality of horizontal raster lines, of frontal views of a speaker'"'"'s face in a speech classification system, the method comprising the following steps:

a) sampling and quantizing each image at uniform intervals along each horizontal raster line of the image to produce an image represented by an array of pixels, each pixel value representing gray-scale level;

b) preconditioning the pixel image by spatially smoothing and enhancing edges separating regions of greater and less gray-scale intensity using spatial convolution techniques;

c) thresholding the preconditioned pixel image by using a threshold value for determining a left eye area, a right eye area, and a mouth area, wherein the threshold value is used to define each of the left eye area, the right eye area, and the mouth area;

d) calculating a left eye area location, a right eye area location, and a mouth area location from the left eye area, the right eye area, and the mouth area, respectively;

e) establishing an eye line segment as a straight line connecting the left and right eye area locations;

f) establishing a vertical axis of symmetry as a straight line that is perpendicular to and bisects the eye line segment connecting the left and right eye area locations;

g) establishing a mouth line by passing a straight line through the mouth area location, the mouth line being perpendicular to the vertical axis of symmetry;

h) selecting image pixels along the axis of symmetry in the vicinity of the mouth line to form a vertical sectional view of gray-scale pixel values;

i) selecting image pixels along the mouth line in the vicinity of the axis of symmetry to form a horizontal sectional view of gray-scale values; and

j) selecting a set of pixels and associated pixel values that occur at the peaks and valleys (maximas and minimas) of the vertical and the horizontal gray-scale pixel value sectional views as a set of elements of a visual feature vector.

View all claims

0 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A facial feature extraction method and apparatus uses the variation in light intensity (gray-scale) of a frontal view of a speaker'"'"'s face. The sequence of video images are sampled and quantized into a regular array of 150×150 pixels that naturally form a coordinate system of scan lines and pixel position along a scan line. Left and right eye areas and a mouth are located by thresholding the pixel gray-scale and finding the centroids of the three areas. The line segment joining the eye area centroids is bisected at right angle to form an axis of symmetry. A straight line through the centroid of the mouth area that is at right angle to the axis of symmetry constitutes the mouth line. Pixels along the mouth line and the axis of symmetry in the vicinity of the mouth area form a horizontal and vertical gray-scale profile, respectively. The profiles could be used as feature vectors but it is more efficient to select peaks and valleys (maximas and minimas) of the profile that correspond to the important physiological speech features such as lower and upper lip, mouth corner, and mouth area positions and pixel values and their time derivatives as visual vector components. Time derivatives are estimated by pixel position and value changes between video image frames. A speech recognition system uses the visual feature vector in combination with a concomitant acoustic vector as inputs to a time-delay neural network.

Citations

13 Claims

1. A method for extracting a visual feature vector from a sequence of images, each having a plurality of horizontal raster lines, of frontal views of a speaker'"'"'s face in a speech classification system, the method comprising the following steps:
- a) sampling and quantizing each image at uniform intervals along each horizontal raster line of the image to produce an image represented by an array of pixels, each pixel value representing gray-scale level;
  
  b) preconditioning the pixel image by spatially smoothing and enhancing edges separating regions of greater and less gray-scale intensity using spatial convolution techniques;
  
  c) thresholding the preconditioned pixel image by using a threshold value for determining a left eye area, a right eye area, and a mouth area, wherein the threshold value is used to define each of the left eye area, the right eye area, and the mouth area;
  
  d) calculating a left eye area location, a right eye area location, and a mouth area location from the left eye area, the right eye area, and the mouth area, respectively;
  
  e) establishing an eye line segment as a straight line connecting the left and right eye area locations;
  
  f) establishing a vertical axis of symmetry as a straight line that is perpendicular to and bisects the eye line segment connecting the left and right eye area locations;
  
  g) establishing a mouth line by passing a straight line through the mouth area location, the mouth line being perpendicular to the vertical axis of symmetry;
  
  h) selecting image pixels along the axis of symmetry in the vicinity of the mouth line to form a vertical sectional view of gray-scale pixel values;
  
  i) selecting image pixels along the mouth line in the vicinity of the axis of symmetry to form a horizontal sectional view of gray-scale values; and
  
  j) selecting a set of pixels and associated pixel values that occur at the peaks and valleys (maximas and minimas) of the vertical and the horizontal gray-scale pixel value sectional views as a set of elements of a visual feature vector.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The method of claim 1 wherein the step of selecting image pixels along the mouth line results in the selected pixels corresponding to a left and a right mouth corner position, and wherein the step of selecting image pixels along the axis of symmetry results in the selected pixels corresponding to an upper lip, a mouth, and a lower lip position.
  - 3. The method of claim 2 further comprising the following steps:
    - k) computing a mouth corner-to-corner distance measure by taking the difference between the location of the selected pixels corresponding to the left and right mouth corner positions;
      
      l) computing a vertical mouth separation distance by taking the difference between the location of the selected pixels corresponding to the upper and lower lip positions;
      
      m) computing horizontal mouth corner-to-corner speed by taking the difference between mouth corner-to-corner distances of adjacent sequential image frames;
      
      n) computing a set of vertical speeds of an upper lip position, of a mouth area position, and of a lower lip position by taking the difference between pixel positions corresponding to upper lip positions, mouth area positions, and lower lip positions of adjacent sequential image frames;
      
      o) computing a set of pixel gray-level value changes with respect to time by taking the difference in pixel values of the selected set of pixels between adjacent sequential image frames; and
      
      p) forming a visual feature vector from the values calculated in steps (k) through (o) and from the selected set of pixel values.
  - 4. The method of claim 1 wherein the step of selecting image pixels along the mouth line selects pixels so that the selected pixels correspond to a left and right mouth corner position and wherein the step of selecting image pixels along the axis of symmetry selects pixels so that the selected pixels correspond to a lower nose area position, an upper lip position, a mouth area position, a lower lip position, and a chin area position, the lower nose area position and the chin area position respectively defining the upper and lower limits of the vertical sectional view of gray-scale pixel values.
  - 5. The method of claim 1 further comprising the step of frame-to-frame temporal smoothing for noise reduction by convolving sequential image frames with a prescribed low-pass filter kernel.

6. An apparatus for extracting a visual feature vector from a sequence of raster-scanned video images of frontal views of a speaker'"'"'s face for use in a speech classification system, the apparatus comprising:
- a) analog-to-digital conversion means for sampling and quantizing each raster-scanned video image at uniform intervals along each raster scan to produce an image of pixels representing gray-scale level centered at the uniform intervals;
  
  b) filter means for preconditioning the pixel image by spatially smoothing and enhancing edges separating regions of greater and less gray-scale intensity using spatial convolution techniques;
  
  c) threshold means using a threshold level for thresholding the preconditioned pixel image for determining a left eye area, a right eye area, and a mouth area, wherein the threshold value is used to define each of the left eye area, the right eye area, and the mouth area;
  
  d) computing means for calculating a left eye area location, a right eye area location, and a mouth area location from the left eye area, the right eye area, and the mouth area, respectively;
  
  e) computing means for establishing an eye line as a straight line passing through the left and right eye area locations;
  
  f) computing means for establishing a vertical axis of symmetry as the straight line that is perpendicular to and bisects the eye line segment connecting the left and right eye area locations;
  
  g) computing means for establishing a mouth line by passing a straight line through the mouth area location, the line being perpendicular to the vertical axis of symmetry;
  
  h) computing means for selecting image pixels along the axis of symmetry in the vicinity of the mouth line to form a vertical sectional view of gray-scale pixel values;
  
  i) computing means for selecting image pixels along the mouth line in the vicinity of the axis of symmetry to form a horizontal sectional view of gray-scale pixel values; and
  
  j) computing means for selecting a set of pixels and associated pixel values that occur at peaks and valleys (maximas and minimas) of the vertical and horizontal gray-scale pixel value sectional views as a set of elements of a visual feature vector.
- View Dependent Claims (7, 8)
- - 7. The apparatus of claim 6 wherein the filter and threshold means are computing means.
  - 8. The apparatus of claim 7 wherein the computing means are a programmable computer.

9. A speech recognition system for recognizing utterances belonging to a preestablished set of allowable candidate utterances, comprising:
- a) a visual feature vector extraction apparatus for forming a sequence of visual feature vectors from a sequence of raster-scanned video images of a speaker'"'"'s face, comprising;
  
  i) analog-to-digital conversion means for sampling and quantizing each raster-scanned video image at uniform intervals along each raster scan to produce an image of pixels representing gray-scale centered at the uniform intervals;
  
  ii) filter means for preconditioning the pixel image by spatially smoothing and enhancing edges separating regions of greater and less gray-scale intensity using convolution techniques;
  
  iii) threshold means using a threshold level for thresholding the preconditioned pixel image for determining a left eye area, a right eye area, and a mouth area, wherein the threshold value is used to define each of the left eye area, the right eye area, and the mouth area;
  
  iv) computing means for calculating a left eye area location, a right eye area location and a mouth area location from the left eye area, the right eye area, and the mouth area, respectively;
  
  v) computing means for establishing an eye line as a straight line passing through the left and right eye area locations;
  
  vi) computing means for establishing a vertical axis of symmetry as the straight line that is perpendicular to and bisects the eye line segment connecting the left and right eye area locations;
  
  vii) computing means for establishing a mouth line by passing a straight line through the mouth area location, the mouth line being perpendicular to the vertical axis of symmetry;
  
  viii) computing means for selecting image pixels along the axis of symmetry in the vicinity of the mouth line to form a vertical sectional view of gray-scale pixel values;
  
  ix) computing means for selecting image pixels along the mouth line in the vicinity of the axis of symmetry to form a horizontal sectional view of gray-scale pixel values; and
  
  x) computing means for selecting a set of pixels and associated pixel values that occur at peaks and valleys (maximas and minimas) of the vertical and horizontal gray-scale pixel value sectional views as a set of elements of a visual feature vector;
  
  b) an acoustic feature vector extraction apparatus for converting signals representative of acoustic speech occurring concomitantly with the raster-scanned video images into a corresponding sequence of acoustic feature vectors; and
  
  c) a neural network classifying apparatus for generating a conditional probability distribution of the allowable speech utterances by accepting and operating on the acoustic and visual feature vector sequences respectively supplied by the acoustic and visual feature vector extraction apparatus.
- View Dependent Claims (10, 11, 12, 13)
- - 10. The system of claim 9 wherein the visual feature vector extraction apparatus comprises a programmable computer for performing the filter, threshold, and computing means functions.
  - 11. The system of claim 9 wherein the step of selecting image pixels along the mouth line selects pixels so that the selected pixels correspond to a left and a right mouth corner position and wherein the step of selecting image pixels along the axis of symmetry selects pixels so that the selected pixels correspond to a lower nose area position, an upper lip position, a mouth area position, a lower lip position, and a chin area position, the lower nose area position and the chin area position respectively defining the upper and lower limits of the vertical sectional view of gray-scale pixel values.
  - 12. The system of claim 11 wherein the computing means for selecting a set of pixel and associated pixel values as a set of elements of a visual feature vector further comprises computer means for the following steps:
    - aa) computing a mouth corner-to-corner distance measure by taking the difference between the location of the selected pixels corresponding to the left and right mouth corner positions;
      
      bb) computing a vertical mouth separation distance by taking the difference between the location of the selected pixels corresponding to the upper and lower lip positions;
      
      cc) computing horizontal mouth corner-to-corner speed by taking the difference between mouth corner-to-corner distances of adjacent sequential image frames;
      
      dd) computing a set of vertical speeds of an upper lip position, of a mouth area position, and of a lower lip position by taking the difference between pixel positions corresponding to upper lip positions, mouth area positions, and lower lip positions of adjacent sequential image frames;
      
      ee) computing a set of pixel gray-level value changes with respect to time by taking the difference in pixel values of the selected set of pixels between adjacent sequential image frames; and
      
      ff) forming a visual feature vector from the values calculated in steps (a) through (e) and from the selected set of pixel values.
  - 13. The system of claim 9 further comprising means for temporal smoothing across successive pixel image frames for noise reduction by convolution with a prescribed low-pass filter kernel.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Ricoh Company Limited, Ricoh Corporation (Ricoh Company Limited)
Original Assignee
Ricoh Company Limited, Ricoh Corporation (Ricoh Company Limited)
Inventors
Stork, David G., Prasad, K. Venkatesh
Primary Examiner(s)
Boudreau, Leo
Assistant Examiner(s)
Davis, Monica S.

Application Number

US08/488,840
Time in Patent Office

865 Days
Field of Search

382/115, 382/204-205, 382/258, 382/282, 382/209, 382/190, 382/118, 382/193, 382/202, 382/159, 348/77
US Class Current

382/190
CPC Class Codes

G06N 3/049   Temporal neural networks, e...

G06V 30/248   involving plural approaches...

G06V 40/171   Local features and componen...

G06V 40/20   Movements or behaviour, e.g...

G10L 15/16   using artificial neural net...

G10L 15/25   using position of the lips,...

Facial feature extraction method and apparatus for a neural network acoustic and visual speech recognition system

First Claim

0 Assignments

0 Petitions

Accused Products

Abstract

Citations

13 Claims

Specification

Solutions

Use Cases

Quick Links

Facial feature extraction method and apparatus for a neural network acoustic and visual speech recognition system

First Claim

0 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

13 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links