Audio visual speech processing
First Claim
Patent Images
1. A system for recognizing speech spoken by a speaker comprising:
- at least one visual transducer with a view of the speaker;
at least one audio transducer receiving the spoken speech;
an audio speech recognizer in communication with the at least one audio transducer, the audio speech recognizer determining a subset of speech elements for at least one speech segment received from the at least one audio transducer, the subset including a plurality of speech elements more likely to represent the speech segment; and
a visual speech recognizer in communication with the at least one visual transducer and the audio speech recognizer, the visual speech recognizer operative to;
(a) receive at least one image from the at least one visual transducer corresponding to a particular speech segment;
(b) receive the subset of speech elements from the audio speech recognizer corresponding to the particular speech segment; and
(c) determine a figure of merit for at least one of the subset of speech elements based on the at least one received image.
5 Assignments
0 Petitions
Accused Products
Abstract
Recognizing and enhancing speech is accomplished by fusing audio and visual speech recognition. An audio speech recognizer determines a subset of speech elements for speech segments received from at least one audio transducer. A visual speech recognizer determines a figure of merit for at least one speech element based on at least one image received from at least one visual transducer. Speech may also be enhanced by variably filtering or editing received audio signals based on at least one visual speech parameter.
-
Citations
51 Claims
-
1. A system for recognizing speech spoken by a speaker comprising:
-
at least one visual transducer with a view of the speaker;
at least one audio transducer receiving the spoken speech;
an audio speech recognizer in communication with the at least one audio transducer, the audio speech recognizer determining a subset of speech elements for at least one speech segment received from the at least one audio transducer, the subset including a plurality of speech elements more likely to represent the speech segment; and
a visual speech recognizer in communication with the at least one visual transducer and the audio speech recognizer, the visual speech recognizer operative to;
(a) receive at least one image from the at least one visual transducer corresponding to a particular speech segment;
(b) receive the subset of speech elements from the audio speech recognizer corresponding to the particular speech segment; and
(c) determine a figure of merit for at least one of the subset of speech elements based on the at least one received image. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 26, 27, 28, 29, 30, 31, 32, 33, 34, 36, 37, 38, 39, 40)
-
-
13. A method for recognizing speech from a speaker, the method comprising:
-
receiving a sequence of audio speech segments from the speaker;
for each of at least one of the audio speech segments, determining a subset of possible speech elements most probably spoken by the speaker during the audio speech segment;
receiving at least one image of the speaker corresponding to the audio speech segment;
extracting at least one feature from the at least one image of the speaker; and
determining the most likely speech element from the subset of speech elements based on the at least one extracted feature.
-
-
25. A system for enhancing speech spoken by a speaker comprising:
-
at least one visual transducer with a view of the speaker;
at least one audio transducer receiving the spoken speech;
a visual speech recognizer in communication with the at least one visual transducer, the visual speech recognizer estimating at least one visual speech parameter for each segment of speech; and
a variable filter filtering output from at least one of the audio transducers, the variable filter having at least one parameter value based on the at least one estimated visual speech parameter.
-
-
35. A method of enhancing speech from a speaker comprising:
-
receiving a sequence of images of the speaker for a speech segment;
determining at least one visual speech parameter for the speech segment based on the sequence of images;
receiving an audio signal corresponding to the speech segment; and
variably filtering the received audio signal based on the determined at least one visual speech parameter.
-
-
41. A method of enhancing speech from a speaker comprising:
-
receiving a sequence of images of the speaker for a speech segment;
determining at least one visual speech parameter for the speech segment based on the sequence of images;
receiving an audio signal corresponding to the speech segment; and
editing the received audio signal based on the determined at least one visual speech parameter. - View Dependent Claims (42, 43, 44, 45, 47, 48, 49, 50, 51)
-
-
46. A method of detecting speech comprising:
-
using at least one visual cue about a speaker to filter an audio signal containing the speech;
determining a plurality of possible speech elements for each segment of the speech from the filtered audio signal; and
selecting among the plurality of possible speech elements based on the at least one visual cue.
-
Specification