Correlating video images of lip movements with audio signals to improve speech recognition

US 7,587,318 B2
Filed: 09/12/2003
Issued: 09/08/2009
Est. Priority Date: 09/12/2002
Status: Expired due to Fees

First Claim

Patent Images

1. A method of speech recognition, comprising:

determining if video images of a speech source are detected;

indicating if the video images are not detected;

receiving audio signals from the speech source;

receiving video signals from the speech source;

detecting if the audio signals can be processed;

processing the audio signals if it is detected that the audio signals can be processed;

processing the video signals based on a detection that at least a portion of the audio signal cannot be processed;

converting at least one of the audio signals and the video signals into recognizable information; and

implementing a task based on the recognizable information.

View all claims

6 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A speech recognition device can include an audio signal receiver configured to receive audio signals from a speech source, a video signal receiver configured to receive video signals from the speech source, and a processing unit configured to process the audio signals and the video signals. In addition, the speech recognition device can include a conversion unit configured to convert the audio signals and the video signals to recognizable speech, and an implementation unit configured to implement a task based on the recognizable speech.

61 Citations

View as Search Results

40 Claims

1. A method of speech recognition, comprising:
- determining if video images of a speech source are detected;
  
  indicating if the video images are not detected;
  
  receiving audio signals from the speech source;
  
  receiving video signals from the speech source;
  
  detecting if the audio signals can be processed;
  
  processing the audio signals if it is detected that the audio signals can be processed;
  
  processing the video signals based on a detection that at least a portion of the audio signal cannot be processed;
  
  converting at least one of the audio signals and the video signals into recognizable information; and
  
  implementing a task based on the recognizable information.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
- - 2. The method of claim 1, wherein receiving the video signals comprises:
    - receiving video images of lip movements that coincide with the audio signals.
  - 3. The method of claim 1, whereinthe audio signals and the video signals are processed in parallel, the video signals coinciding with the audio signals.
  - 4. The method of claim 1, comprising:
    - storing the audio signals and the video signals; and
      
      sending the audio signals and the video signals to a destination source.
  - 5. The method of claim 1, wherein at least the receiving of the audio signals and the receiving of the video signals occurs in a mobile phone.
  - 6. The method of claim 1, wherein at least the receiving of the audio signals and the receiving of the video signals occurs in a laptop computer, a home computer, a remote controller and/or a game console.
  - 7. The method of claim 1, wherein the method occurs in a mobile phone.
  - 8. The method of claim 7, wherein the mobile phone comprises a lens and a display.
  - 9. The method of claim 7, wherein the method is part of a voice activated e-mail application.
  - 10. The method of claim 7, wherein the recognizable information comprises one or more numeric characters.
  - 11. The method of claim 7, wherein the recognizable information comprises code that is used to perform a particular function.
  - 12. The method of claim 7, wherein the recognizable information comprises at least one of text and one or more executable commands.
  - 13. The method of claim 1, wherein the method occurs a laptop computer, a home computer, a PDA, an audio/video recording device, a remote controller and/or a game console.
  - 14. The method of claim 1, wherein the recognizable information comprises at least one of text and one or more executable commands.

15. A speech recognition device, comprising:
- an audio signal receiver configured to receive audio signals from a speech source;
  
  a video signal receiver configured to receive video signals from the speech source;
  
  a processing unit configured to detect if the audio signals can be processed and if so, to process the audio signals and process the video signals based on the detection that at least a portion of the audio signals cannot be processed;
  
  a conversion unit configured to convert at lease one of the audio signals and the video signals to recognizable information; and
  
  an implementation unit configured to implement a task based on the recognizable information,wherein the processing unit is configured to determine if the video image of a user is detected and, if the video image of the user is not detected, to indicate to the user that the video image is not detected.
- View Dependent Claims (16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27)
- - 16. The speech recognition device of claim 15, wherein the video signal receiver is configured to receive video images of lip movements that coincide with the audio signals.
  - 17. The speech recognition device of claim 15, wherein the processing unit is configured to process the audio signals and the video signals in parallel, and wherein the video signals coincide with the audio signals.
  - 18. The speech recognition device of claim 15, comprises:
    - a storage unit for storing the audio signals and the video signals; and
      
      a transmitter for sending the audio signals and the video signals to a destination source.
  - 19. The speech recognition device of claim 15, wherein the speech recognition device is part of a mobile phone.
  - 20. The speech recognition device of claim 19, wherein the mobile phone comprises a lens and a display.
  - 21. The speech recognition device of claim 19, wherein the speech recognition device is used with a voice activated e-mail application.
  - 22. The speech recognition device of claim 19, wherein the recognizable information comprises one or more numeric characters.
  - 23. The speech recognition device of claim 19, wherein the recognizable information comprises code that is used to perform a particular function.
  - 24. The speech recognition device of claim 19, wherein at least the processing unit, the conversion unit and the implementation unit are integrated on a single chip.
  - 25. The speech recognition device of claim 19, wherein the recognizable information comprises at least one of text and one or more executable commands.
  - 26. The speech recognition device of claim 15, wherein the speech recognition device is part of a laptop computer, a home computer, a PDA, an audio/video recording device, a remote controller and/or a game console.
  - 27. The speech recognition device of claim 15, wherein the recognizable information comprises at least one of text and one or more executable commands.

28. A system for speech recognition, comprising:
- a first receiver that receives audio signals from a speech source;
  
  a second receiver that receives video signals from the speech source;
  
  a processor that detects if the audio signals can be processed and that processes the audio signals if the audio signals can be processed, the processor processing the video signals based on the detection that at least a portion of the audio signals can not be processed;
  
  a converter that converts at least one of the audio signals and the video signals to recognizable information; and
  
  an implementor that implements a task based on the recognizable information,wherein the processor determines if the video image of a user is detected and, if the user'"'"'s video image is not detected, indicates to the user that the video image is not detected.
- View Dependent Claims (29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40)
- - 29. The system of claim 28, wherein the second receiver receives video images of lip movements that coincide with the audio signals.
  - 30. The system of claim 28, wherein the processor processes the audio signals and the video signals in parallel, and wherein the video signals coincide with the audio signals.
  - 31. The system of claim 28, comprises:
    - a storage device that stores the audio signals and the video signals; and
      
      a transmitter that transmits the audio signals and the video signals to a destination source.
  - 32. The system of claim 28, wherein the system for speech recognition is part of a mobile phone.
  - 33. The system of claim 32, wherein the mobile phone comprises a lens and a display.
  - 34. The system of claim 32, wherein the system for speech recognition is used with a voice activated e-mail application.
  - 35. The system of claim 32, wherein the recognizable information comprises one or more numeric characters.
  - 36. The system of claim 32, wherein the recognizable information comprises code that is used to perform a particular function.
  - 37. The system of claim 32, wherein at least the processor, the converter and the implementor are integrated on a single chip.
  - 38. The system of claim 32, wherein the recognizable information comprises at least one of text and one or more executable commands.
  - 39. The system of claim 28, wherein the system for speech recognition is part of a laptop computer, a home computer, a remote controller and/or a game console.
  - 40. The system of claim 28, wherein the recognizable information comprises at least one of text and one or more executable commands.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Avago Technologies International Sales Pte Limited (Broadcom, Inc.)
Original Assignee
Broadcom Corporation (Broadcom, Inc.)
Inventors
Seshadri, Nambi
Primary Examiner(s)
Lerner; Martin

Application Number

US10/660,780
Publication Number

US 20040117191A1
Time in Patent Office

2,188 Days
Field of Search

704/231, 704/236, 704/246, 704/247, 704/251, 704/252, 704/270, 704/271, 704/273, 382/116
US Class Current

704/231
CPC Class Codes

G10L 15/25 using position of the lips,...

Correlating video images of lip movements with audio signals to improve speech recognition

First Claim

6 Assignments

0 Petitions

Accused Products

Abstract

61 Citations

40 Claims

Specification

Solutions

Use Cases

Quick Links

Correlating video images of lip movements with audio signals to improve speech recognition

First Claim

6 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

61 Citations

40 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links