Method and apparatus for recognizing speech by lip reading

US 9,997,159 B2
Filed: 07/13/2017
Issued: 06/12/2018
Est. Priority Date: 11/26/2014
Status: Active Grant

First Claim

Patent Images

1. A dictation device comprising:

an audio input device configured to receive a voice utterance including a plurality of words;

a video input device configured to receive video of lip motion during the voice utterance;

a memory portion;

a controller configured according to instructions in the memory portion to generate first data packets including an audio stream representative of the voice utterance and a video stream representative of the lip motion; and

a transceiver for sending the first data packets to a remote apparatus and receiving second data packets including combined dictation based upon the audio stream and the video stream from the remote apparatus,wherein in the combined dictation, at least one word in first dictation generated based upon the audio stream which has a predetermined characteristic has been corrected by second dictation generated based upon the video stream,wherein the controller is further configured to permit a user to select (i) enabling both of the audio input device and the video input device;

(ii) disabling, only the audio input device; and

(iii) disabling only the video input device, andwherein the first data packets further include global positioning system (GPS) data associated with the dictation device, and the dictation device is set to use the first dictation or the second dictation based on the GPS data.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A dictation device includes: an audio input device configured to receive a voice utterance including a plurality of words; a video input device configured to receive video of lip motion during the voice utterance; a memory portion; a controller configured according to instructions in the memory portion to generate first data packets including an audio stream representative of the voice utterance and a video stream representative of the lip motion; and a transceiver for sending the first data packets to a server end device and receiving second data packets including combined dictation based upon the audio stream and the video stream from the server end device. In the combined dictation, first dictation generated based upon the audio stream has been corrected by second dictation generated based upon the video stream.

41 Citations

View as Search Results

18 Claims

1. A dictation device comprising:
- an audio input device configured to receive a voice utterance including a plurality of words;
  
  a video input device configured to receive video of lip motion during the voice utterance;
  
  a memory portion;
  
  a controller configured according to instructions in the memory portion to generate first data packets including an audio stream representative of the voice utterance and a video stream representative of the lip motion; and
  
  a transceiver for sending the first data packets to a remote apparatus and receiving second data packets including combined dictation based upon the audio stream and the video stream from the remote apparatus,wherein in the combined dictation, at least one word in first dictation generated based upon the audio stream which has a predetermined characteristic has been corrected by second dictation generated based upon the video stream,wherein the controller is further configured to permit a user to select (i) enabling both of the audio input device and the video input device;
  
  (ii) disabling, only the audio input device; and
  
  (iii) disabling only the video input device, andwherein the first data packets further include global positioning system (GPS) data associated with the dictation device, and the dictation device is set to use the first dictation or the second dictation based on the GPS data.
- View Dependent Claims (2, 3, 4)
- - 2. The dictation device of claim 1, further comprising a display, wherein the controller is further configured to render the combined dictation as text on the display.
  - 3. The dictation device of claim 1, wherein the transceiver is further configured to send the second data packets to a downstream application.
  - 4. The dictation device of claim 1, wherein the predetermined characteristic is the at least one word has four or less syllables and is less than a predetermined length or time duration.

5. A server end device comprising:
- a transceiver configured to receive first data packets from and send second data packets to a remote mobile station via a connection to a network, the received first data packets including an audio stream and a video stream associated with a voice utterance of a plurality of words;
  
  a controller coupled to the transceiver;
  
  an audio based speech recognition device coupled to the controller and configured to generate first dictation based upon the audio stream received from the remote mobile station;
  
  a video based speech recognition device coupled to the controller and configured to generate second dictation based upon the video stream received from the remote mobile station; and
  
  a memory including instructions for configuring the controller to generate a combined dictation based upon a comparison between the first dictation and the second dictation and include the combined dictation in the second data packets to be sent,wherein the controller is further configured to;
  
  determine if at least one of the words in the first dictation has a predetermined characteristic; and
  
  generate the combined dictation based upon the second dictation for the at least one of the words having the predetermined characteristic and based upon the first dictation for the other of the plurality of words,wherein the audio stream and the video stream are combined into an Moving Picture Experts Group (MPEG) stream according to an MPEG format,wherein synchronization data of the MPEG stream is used to determine a portion of the video stream that corresponds to the at least one of the words having the predetermined characteristic, andwherein the first data packets include global positioning system (GPS) data associated with the remote mobile station, and the server end device is set to use the first dictation or the second dictation based on the GPS data.
- View Dependent Claims (6, 7, 8, 9, 10)
- - 6. The server end device of claim 5, wherein the predetermined characteristic is that the at least one word is not similar to the corresponding at least one word in the second dictation.
  - 7. The server end device of claim 5, wherein the predetermined characteristic is further the at least one word has four or less syllables.
  - 8. The server end device of claim 5, wherein the audio based speech recognition device generates the first dictation by:
    - extracting a feature signal associated with the audio stream;
      
      for each of a plurality of candidate prototype words, determining a probability that the respective candidate prototype word generates the feature signal; and
      
      choosing the candidate prototype word having highest probability among the plurality of candidate prototype words,wherein the predetermined characteristic is that the probability associated with the chosen candidate prototype word is less than a predetermined standard.
  - 9. A system for performing speech-to-text services comprising:
    - the server end device of claim 5,wherein the remote mobile station includes;
      
      an audio input device configured to receive the voice utterance including the plurality of words;
      
      a video input device configured to receive video of lip motion during the voice utterance;
      
      a memory portion;
      
      a controller configured according to instructions in the memory portion to generate first data packets including the audio stream which is representative of the voice utterance and the video stream which is representative of the lip motion; and
      
      a transceiver for sending the first data packets to the server end device and receiving second data packets including the combined dictation from the server end device,wherein the controller is further configured to permit a user to select (i) enabling both of the audio input device and the video input device;
      
      (ii) disabling only the audio input device; and
      
      (iii) disabling only the video input device.
  - 10. The server end device of claim 5, wherein the video based speech recognition device generates the second dictation based upon the video stream by:
    - extracting a sequence of image frames from a predetermined portion of the video stream;
      
      generating a local binary pattern (LBP) from a series of images;
      
      matching the LBP to a feature signal vector stored in the memory;
      
      determining a probability for each of a plurality of candidate prototype words generating the feature signal vector;
      
      selecting a candidate prototype word of the plurality of candidate prototype words of highest probability to be the second dictation.

11. A dictation device for generating text based on a voice utterance and lip movement associated with the voice utterance, comprising:
- an audio input device that receives an audio signal representing the voice utterance;
  
  a video input device that receives a video signal representative of the lip movement;
  
  a controller configured according to instructions stored in a memory, the controller configured to;
  
  combine the audio signal and the video signal into an Moving Picture Experts Group (MPEG) stream according to an MPEG format,generate a first dictation based on the audio signal and assign a first conversion value based on a first set of conversion criteria;
  
  generate a second dictation based on the video signal and assign a second conversion value based on a second set of conversion criteria; and
  
  generate a variable text conversion value based on the first conversion value and the second conversion value and generate a third dictation based on the variable text conversion value;
  
  wherein the controller is configured to prioritize either the first conversion value or the second conversion value by a predetermined setting assigned by a user, andwherein the controller uses synchronisation data of the MPEG stream to determine a portion of the video signal that corresponds to a portion of the audio signal when generating the third dictation, anddetermine a global positioning system (GPS) data associated with the dictation device, and the dictation device is set to use the first dictation or the second dictation based on the GPS data.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18)
- - 12. The dictation device of claim 11, wherein:
    - the audio input device is disabled when a signal to noise ratio between the audio signal and a background noise is below a predetermined threshold; and
      
      the video input device is disabled when a signal to brightness ratio is below a predetermined threshold.
  - 13. The dictation device of claim 11, wherein if a volume of the audio signal is lower than a predetermined value, the controller generates the variable text conversion value without the first conversion value.
  - 14. The dictation device of claim 11, wherein when the video input device detects no lip movement, the audio input device is disabled.
  - 15. The dictation device of claim 11, wherein the controller generates the variable text conversion value based on global positioning system (GPS) data.
  - 16. The dictation device of claim 11, wherein:
    - the first set of conversion criteria includes pre-registered data representing a value associated with the user voice; and
      
      the second set of conversion criteria includes pre-registered data representing a value associated with the user voice.
  - 17. The dictation device of claim 11, wherein the variable text conversion value is generated based on a predetermined criteria which includes pre-registered data representing a value of the user voice.
  - 18. The dictation device of claim 11, wherein the predetermined setting can be assigned by a user.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Panasonic Intellectual Property Corporation of America (Panasonic Holdings Corporation)
Original Assignee
Panasonic Intellectual Property Corporation of America (Panasonic Holdings Corporation)
Inventors
Takayanagi, Yuichiro, Kusaka, Masashi
Primary Examiner(s)
Shah, Paras D
Assistant Examiner(s)
Le, Thuykhanh

Application Number

US15/649,251
Publication Number

US 20170309275A1
Time in Patent Office

334 Days
Field of Search

None
US Class Current
CPC Class Codes

G10L 15/25   using position of the lips,...

G10L 15/30   Distributed recognition, e....

G10L 15/32   Multiple recognisers used i...

Method and apparatus for recognizing speech by lip reading

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

41 Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for recognizing speech by lip reading

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

41 Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links