Facial feature extraction method and apparatus for a neural network acoustic and visual speech recognition system
First Claim
1. A method for extracting a visual feature vector from a sequence of images, each having a plurality of horizontal raster lines, of frontal views of a speaker'"'"'s face in a speech classification system, the method comprising the following steps:
- a) sampling and quantizing each image at uniform intervals along each horizontal raster line of the image to produce an image represented by an array of pixels, each pixel value representing gray-scale level;
b) preconditioning the pixel image by spatially smoothing and enhancing edges separating regions of greater and less gray-scale intensity using spatial convolution techniques;
c) thresholding the preconditioned pixel image by using a threshold value for determining a left eye area, a right eye area, and a mouth area, wherein the threshold value is used to define each of the left eye area, the right eye area, and the mouth area;
d) calculating a left eye area location, a right eye area location, and a mouth area location from the left eye area, the right eye area, and the mouth area, respectively;
e) establishing an eye line segment as a straight line connecting the left and right eye area locations;
f) establishing a vertical axis of symmetry as a straight line that is perpendicular to and bisects the eye line segment connecting the left and right eye area locations;
g) establishing a mouth line by passing a straight line through the mouth area location, the mouth line being perpendicular to the vertical axis of symmetry;
h) selecting image pixels along the axis of symmetry in the vicinity of the mouth line to form a vertical sectional view of gray-scale pixel values;
i) selecting image pixels along the mouth line in the vicinity of the axis of symmetry to form a horizontal sectional view of gray-scale values; and
j) selecting a set of pixels and associated pixel values that occur at the peaks and valleys (maximas and minimas) of the vertical and the horizontal gray-scale pixel value sectional views as a set of elements of a visual feature vector.
0 Assignments
0 Petitions
Accused Products
Abstract
A facial feature extraction method and apparatus uses the variation in light intensity (gray-scale) of a frontal view of a speaker'"'"'s face. The sequence of video images are sampled and quantized into a regular array of 150×150 pixels that naturally form a coordinate system of scan lines and pixel position along a scan line. Left and right eye areas and a mouth are located by thresholding the pixel gray-scale and finding the centroids of the three areas. The line segment joining the eye area centroids is bisected at right angle to form an axis of symmetry. A straight line through the centroid of the mouth area that is at right angle to the axis of symmetry constitutes the mouth line. Pixels along the mouth line and the axis of symmetry in the vicinity of the mouth area form a horizontal and vertical gray-scale profile, respectively. The profiles could be used as feature vectors but it is more efficient to select peaks and valleys (maximas and minimas) of the profile that correspond to the important physiological speech features such as lower and upper lip, mouth corner, and mouth area positions and pixel values and their time derivatives as visual vector components. Time derivatives are estimated by pixel position and value changes between video image frames. A speech recognition system uses the visual feature vector in combination with a concomitant acoustic vector as inputs to a time-delay neural network.
-
Citations
13 Claims
-
1. A method for extracting a visual feature vector from a sequence of images, each having a plurality of horizontal raster lines, of frontal views of a speaker'"'"'s face in a speech classification system, the method comprising the following steps:
-
a) sampling and quantizing each image at uniform intervals along each horizontal raster line of the image to produce an image represented by an array of pixels, each pixel value representing gray-scale level; b) preconditioning the pixel image by spatially smoothing and enhancing edges separating regions of greater and less gray-scale intensity using spatial convolution techniques; c) thresholding the preconditioned pixel image by using a threshold value for determining a left eye area, a right eye area, and a mouth area, wherein the threshold value is used to define each of the left eye area, the right eye area, and the mouth area; d) calculating a left eye area location, a right eye area location, and a mouth area location from the left eye area, the right eye area, and the mouth area, respectively; e) establishing an eye line segment as a straight line connecting the left and right eye area locations; f) establishing a vertical axis of symmetry as a straight line that is perpendicular to and bisects the eye line segment connecting the left and right eye area locations; g) establishing a mouth line by passing a straight line through the mouth area location, the mouth line being perpendicular to the vertical axis of symmetry; h) selecting image pixels along the axis of symmetry in the vicinity of the mouth line to form a vertical sectional view of gray-scale pixel values; i) selecting image pixels along the mouth line in the vicinity of the axis of symmetry to form a horizontal sectional view of gray-scale values; and j) selecting a set of pixels and associated pixel values that occur at the peaks and valleys (maximas and minimas) of the vertical and the horizontal gray-scale pixel value sectional views as a set of elements of a visual feature vector. - View Dependent Claims (2, 3, 4, 5)
-
-
6. An apparatus for extracting a visual feature vector from a sequence of raster-scanned video images of frontal views of a speaker'"'"'s face for use in a speech classification system, the apparatus comprising:
-
a) analog-to-digital conversion means for sampling and quantizing each raster-scanned video image at uniform intervals along each raster scan to produce an image of pixels representing gray-scale level centered at the uniform intervals; b) filter means for preconditioning the pixel image by spatially smoothing and enhancing edges separating regions of greater and less gray-scale intensity using spatial convolution techniques; c) threshold means using a threshold level for thresholding the preconditioned pixel image for determining a left eye area, a right eye area, and a mouth area, wherein the threshold value is used to define each of the left eye area, the right eye area, and the mouth area; d) computing means for calculating a left eye area location, a right eye area location, and a mouth area location from the left eye area, the right eye area, and the mouth area, respectively; e) computing means for establishing an eye line as a straight line passing through the left and right eye area locations; f) computing means for establishing a vertical axis of symmetry as the straight line that is perpendicular to and bisects the eye line segment connecting the left and right eye area locations; g) computing means for establishing a mouth line by passing a straight line through the mouth area location, the line being perpendicular to the vertical axis of symmetry; h) computing means for selecting image pixels along the axis of symmetry in the vicinity of the mouth line to form a vertical sectional view of gray-scale pixel values; i) computing means for selecting image pixels along the mouth line in the vicinity of the axis of symmetry to form a horizontal sectional view of gray-scale pixel values; and j) computing means for selecting a set of pixels and associated pixel values that occur at peaks and valleys (maximas and minimas) of the vertical and horizontal gray-scale pixel value sectional views as a set of elements of a visual feature vector. - View Dependent Claims (7, 8)
-
-
9. A speech recognition system for recognizing utterances belonging to a preestablished set of allowable candidate utterances, comprising:
-
a) a visual feature vector extraction apparatus for forming a sequence of visual feature vectors from a sequence of raster-scanned video images of a speaker'"'"'s face, comprising; i) analog-to-digital conversion means for sampling and quantizing each raster-scanned video image at uniform intervals along each raster scan to produce an image of pixels representing gray-scale centered at the uniform intervals; ii) filter means for preconditioning the pixel image by spatially smoothing and enhancing edges separating regions of greater and less gray-scale intensity using convolution techniques; iii) threshold means using a threshold level for thresholding the preconditioned pixel image for determining a left eye area, a right eye area, and a mouth area, wherein the threshold value is used to define each of the left eye area, the right eye area, and the mouth area; iv) computing means for calculating a left eye area location, a right eye area location and a mouth area location from the left eye area, the right eye area, and the mouth area, respectively; v) computing means for establishing an eye line as a straight line passing through the left and right eye area locations; vi) computing means for establishing a vertical axis of symmetry as the straight line that is perpendicular to and bisects the eye line segment connecting the left and right eye area locations; vii) computing means for establishing a mouth line by passing a straight line through the mouth area location, the mouth line being perpendicular to the vertical axis of symmetry; viii) computing means for selecting image pixels along the axis of symmetry in the vicinity of the mouth line to form a vertical sectional view of gray-scale pixel values; ix) computing means for selecting image pixels along the mouth line in the vicinity of the axis of symmetry to form a horizontal sectional view of gray-scale pixel values; and x) computing means for selecting a set of pixels and associated pixel values that occur at peaks and valleys (maximas and minimas) of the vertical and horizontal gray-scale pixel value sectional views as a set of elements of a visual feature vector; b) an acoustic feature vector extraction apparatus for converting signals representative of acoustic speech occurring concomitantly with the raster-scanned video images into a corresponding sequence of acoustic feature vectors; and c) a neural network classifying apparatus for generating a conditional probability distribution of the allowable speech utterances by accepting and operating on the acoustic and visual feature vector sequences respectively supplied by the acoustic and visual feature vector extraction apparatus. - View Dependent Claims (10, 11, 12, 13)
-
Specification