Electronic facial tracking and detection system and method and apparatus for automated speech recognition
First Claim
1. An apparatus for producing output indicating words spoken by a human speaker, said apparatus comprising:
- means for detecting sounds, converting said sounds to electrical signals, analyzing said signals to detect for words, and then producing an electrical acoustic output signal representing at least one spoken word;
means for scanning said speaker'"'"'s face and producing electrical image signals, each said signal representing an image in a sequence of video images of said speaker;
means, responsive to each said image signal, for tracking said speaker'"'"'s mouth by tracking said speaker'"'"'s nostrils;
means for analyzing portions of said image signals, said portions defined by said means for tracking, to produce a video output signal representing at least one visual manifestation of at least one spoken word;
means for receiving and correlating said acoustic output signal and said video output signal to produce said output.
1 Assignment
0 Petitions
Accused Products
Abstract
The apparatus includes circuitry for obtaining a video image of an individual'"'"'s face, circuitry for electronically locating and tracking a first feature, such as the nostrils, of the facial image for use as reference coordinates and circuitry responsive to the reference coordinates for locating and tracking a second facial feature, such as the mouth, of the facial image with respect to the first feature. By tracking the location of the nostrils, the apparatus can follow the movement of the mouth, and thereby automatically recognize speech. In a preferred embodiment, the video image is grayscale encoded and the raster lines are smoothed to eliminate noise. The transitions between gray levels of the smoothed image are encoded and the resulting transition code is used to form a contour map of the image from which region parameters are computed which can be compared against stored speech templates to recognize speech. In the preferred embodiment, acoustic speech recognition is combined with visual speech recognition to improve accuracy.
-
Citations
94 Claims
-
1. An apparatus for producing output indicating words spoken by a human speaker, said apparatus comprising:
-
means for detecting sounds, converting said sounds to electrical signals, analyzing said signals to detect for words, and then producing an electrical acoustic output signal representing at least one spoken word; means for scanning said speaker'"'"'s face and producing electrical image signals, each said signal representing an image in a sequence of video images of said speaker; means, responsive to each said image signal, for tracking said speaker'"'"'s mouth by tracking said speaker'"'"'s nostrils; means for analyzing portions of said image signals, said portions defined by said means for tracking, to produce a video output signal representing at least one visual manifestation of at least one spoken word; means for receiving and correlating said acoustic output signal and said video output signal to produce said output. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. An apparatus for producing output indicating words spoken by a human speaker, said apparatus comprising:
-
an acoustic speech recognizer in operative combination with a microphone, for detecting sounds, converting said sounds to electrical signals, analyzing said electrical signals to detect for words, and then producing an electrical acoustic output signal representing an audio manifestation of spoken speech; means for scanning said speaker'"'"'s face and producing electrical image signals, each said signal representing an image in a sequence of video images of said speaker; means for analyzing said image signals to produce respective location signals in response to a detection of said speaker'"'"'s nostrils; means, activatable in response to one of said location signals, for defining a portion of one of said image signals for mouth analysis, said portion representing a region on said speaker'"'"'s face located at a position determined by said speaker'"'"'s nostrils; means for analyzing portions of said video image signals, responsive to said means for defining, to produce a video output signal, said video output signal representing a visual manifestation of speech; means for receiving and correlating said acoustic output signal and said video output signal to produce said output. - View Dependent Claims (8, 9, 10)
-
-
11. An apparatus for electronically detecting a speaker'"'"'s facial features, the apparatus comprising:
-
means for producing electric signals corresponding to a sequence of video images of the speaker; and means, responsive to one of the signals corresponding to one image in the sequence, for finding the speaker'"'"'s nostrils and then using the nostrils to define a region of the image for analyzing the speaker'"'"'s mouth. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47)
-
-
48. A method for producing output indicating words spoken by a human speaker, said method comprising steps of:
-
detecting sounds, converting said sounds to electrical signals, analyzing said signals to detect for words, and then producing an electrical acoustic output signal representing at least one spoken word; scanning said speaker'"'"'s face and producing electrical image signals, each said signal representing an image in a sequence of video images of said speaker; tracking said speaker'"'"'s mouth by tracking said speaker'"'"'s nostrils; analyzing portions of said image signals, said portions defined by said means for tracking, to produce a video output signal representing at least one visual manifestation of at least one spoken word; receiving and correlating said acoustic output signal and said video output signal to produce said output. - View Dependent Claims (49, 50, 51, 52, 53)
-
-
54. A method for producing output indicating words spoken by a human speaker, said method comprising steps of:
-
detecting sounds by means of an acoustic speech recognizer in operative combination with a microphone, converting said sounds to electrical signals, analyzing said electrical signals to detect for words, and then producing an electrical acoustic output signal representing an audio manifestation of spoken speech; scanning said speaker'"'"'s face and producing electrical image signals, each said signal representing an image in a sequence of video images of said speaker; analyzing said image signals to produce respective location signals in response to a detection of said speaker'"'"'s nostrils; defining a portion of one of said image signals for mouth analysis in response to one of said location signals, said portion representing a region on said speaker'"'"'s face located at a position determined by said speaker'"'"'s nostrils; analyzing portions of said video image signals, responsive to said step of defining, t produce a video output signal, said video output signal representing a visual manifestation of speech; receiving and correlating said acoustic output signal and said video output signal to produce said output. - View Dependent Claims (55, 56, 57)
-
-
58. A method for electronically detecting a speaker'"'"'s facial features, the method comprising steps of:
-
producing electric signals corresponding to a sequence of video images of the speaker; and finding the speaker'"'"'s nostrils in one image in the sequence and then using the nostrils to automatically define a region of the image for analyzing the speaker'"'"'s mouth. - View Dependent Claims (59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94)
-
Specification