METHOD AND APPARATUS FOR RECOGNIZING SPEECH BY LIP READING
First Claim
1. A dictation device comprising:
- an audio input device configured to receive a voice utterance including a plurality of words;
a video input device configured to receive video of lip motion during the voice utterance;
a memory portion;
a controller configured according to instructions in the memory portion to generate first data packets including an audio stream representative of the voice utterance and a video stream representative of the lip motion; and
a transceiver for sending the first data packets to a remote apparatus and receiving second data packets including combined dictation based upon the audio stream and the video stream from the remote apparatus,wherein in the combined dictation, at least one word in first dictation generated based upon the audio stream which has a predetermined characteristic has been corrected by second dictation generated based upon the video stream,wherein the controller is further configured to permit a user to select (i) enabling both of the audio input device and the video input device;
(ii) disabling only the audio input device; and
(iii) disabling only the video input device.
1 Assignment
0 Petitions
Accused Products
Abstract
A dictation device includes: an audio input device configured to receive a voice utterance including a plurality of words; a video input device configured to receive video of lip motion during the voice utterance; a memory portion; a controller configured according to instructions in the memory portion to generate first data packets including an audio stream representative of the voice utterance and a video stream representative of the lip motion; and a transceiver for sending the first data packets to a server end device and receiving second data packets including combined dictation based upon the audio stream and the video stream from the server end device. In the combined dictation, first dictation generated based upon the audio stream has been corrected by second dictation generated based upon the video stream.
-
Citations
19 Claims
-
1. A dictation device comprising:
-
an audio input device configured to receive a voice utterance including a plurality of words; a video input device configured to receive video of lip motion during the voice utterance; a memory portion; a controller configured according to instructions in the memory portion to generate first data packets including an audio stream representative of the voice utterance and a video stream representative of the lip motion; and a transceiver for sending the first data packets to a remote apparatus and receiving second data packets including combined dictation based upon the audio stream and the video stream from the remote apparatus, wherein in the combined dictation, at least one word in first dictation generated based upon the audio stream which has a predetermined characteristic has been corrected by second dictation generated based upon the video stream, wherein the controller is further configured to permit a user to select (i) enabling both of the audio input device and the video input device;
(ii) disabling only the audio input device; and
(iii) disabling only the video input device. - View Dependent Claims (2, 3, 19)
-
-
4. A server end device comprising:
-
a transceiver configured to receive first data packets from and send second data packets to a remote mobile station via a connection to a network, the received first data packets including an audio stream and a video stream associated with a voice utterance of a plurality of words; a controller coupled to the interface; an audio based speech recognition device coupled to the controller and configured to generate first dictation based upon the audio stream received from the remote mobile station; a video based speech recognition device coupled to the controller and configured to generate second dictation based upon the video stream received from the remote mobile station; and a memory including instructions for configuring the controller to generate a combined dictation based upon a comparison between the first dictation and the second dictation and include the combined dictation in the second data packets to be sent, wherein the controller is further configured to; determine if at least one of the words in the first dictation has a predetermined characteristic; and generate the combined dictation based upon the second dictation for the at least one of the words having the predetermined characteristic and based upon the first dictation for the other of the plurality of words, wherein the audio stream and the video stream are combined into an MPEG stream according to an MPEG format, wherein synchronization data of the MPEG stream is used to determine a portion of the video stream that corresponds to the at least one of the words. - View Dependent Claims (5, 6, 7, 8, 18)
-
-
9. A dictation device for generating text based on a voice utterance and lip movement associated with the voice utterance, comprising:
-
an audio input device that receives an audio signal representing the voice utterance; a video input device that receives a video signal representative of the lip movement; a controller configured according to instructions stored in a memory, the controller configured to; combine the audio signal and the video signal into an MPEG stream according to an MPEG format, generate a first dictation based on the audio signal and assign a first conversion value based on a first set of conversion criteria; generate a second dictation based on the video signal and assign a second conversion value based on a second set of conversion criteria; and generate a variable text conversion value based on the first conversion value and the second conversion value and generate a third dictation based on the variable text conversion value; wherein the controller is configured to prioritize either the first conversion value or the second conversion value by a predetermined setting assigned by a user, and wherein the controller uses synchronization data of the MPEG stream to determine a portion of the video signal that corresponds to a portion of the audio signal when generating the third dictation. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16, 17)
-
Specification