Video control of speech recognition
First Claim
1. A method of controlling the operation of a speech recognition unit, comprising automatically analyzing at least one video image to detect a gesture of a user that signifies a command, and supplying the command to the speech recognition unit to control operation of the speech recognition unit.
1 Assignment
0 Petitions
Accused Products
Abstract
Method and apparatus for using video input to control speech recognition systems is disclosed. In one embodiment, gestures of a user of a speech recognition system are detected from a video input, and are used to turn a speech recognition unit on and off. In another embodiment, the position of a user is detected from a video input, and the position information supplied to a microphone array point of source filter to aid the filter in selecting the voice of a user that is moving about in the field of the camera supplying the video input.
323 Citations
21 Claims
- 1. A method of controlling the operation of a speech recognition unit, comprising automatically analyzing at least one video image to detect a gesture of a user that signifies a command, and supplying the command to the speech recognition unit to control operation of the speech recognition unit.
- 6. A method comprising filtering with a filter an audio input signal and supplying it to a speech recognition unit, wherein the position of a user supplying speech to be recognized is automatically determined by a computer analysis of at least one video image obtained from one or more cameras having a field of view encompassing the user, and position information obtained from the analysis is used by the filter to at least in part isolate the user'"'"'s voice from other sounds in the user'"'"'s environment.
-
9. A method, comprising
filtering an audio input signal and supplying it to a speech recognition unit, wherein the position of a user supplying speech to be recognized is automatically determined by a computer analysis of a video image obtained at least in part from at least one camera having a field of view encompassing the user, and position information obtained from the analysis is used by a filter to at least in part isolate the user'"'"'s voice from other sounds in the user'"'"'s environment; - and
controlling the operation of the speech recognition unit, comprising analyzing one or more video images to detect a gesture of a user that signifies a command, and supplying that command to the speech recognition unit. - View Dependent Claims (10, 11, 12, 13, 14, 15)
- and
-
16. Apparatus for controlling the operation of a speech recognition unit response to a gesture by a user, comprising
a unit receiving at least one video image of the user, automatically analyzing the video image to detect a gesture of the user that signifies a command, and outputting the command to the speech recognition unit to control operation of the speech unit.
-
17. Apparatus comprising a filtering unit that
receives an audio input signal and position information about a position of a user supplying speech as the source of the audio input signal, wherein the position of the user is automatically determined by analysis of at least one video image of the user, wherein the filter unit further outputs a filtered audio signal based on the position information to a speech recognition unit.
-
18. Apparatus comprising:
-
a speech recognition unit;
a unit that receives at least one video image, automatically analyzes the video image to detect a gesture of a user that signifies a command, outputs the command to the speech recognition unit, and outputs position information about the position of the user in the image that signifies that a user has made the gesture that signifies the command; and
a filtering unit that receives an audio input signal and the position information about the position of a user supplying speech as the source of the audio input signal, the filter unit further supplies a filtered audio signal to the speech recognition unit, wherein the filtered audio signal produced by the filtering unit depends on the position information. - View Dependent Claims (19)
a video signal analyzing unit that receives a video signal from a camera having a user in its field of view and outputs position information indicating the position of the user to the filtering unit.
-
-
20. An article comprising
a computer program in a machine readable medium wherein the computer program will execute on a suitable platform to control the operation of a speech recognition unit and is operative to automatically analyze at least one video image to detect a gesture of a user that signifies a command, and supply the command to the speech recognition unit.
-
21. An article comprising
a computer program embodied in a machine readable medium wherein the computer program executes on a suitable platform to analyze at least one video image obtained from one or more cameras having a field of view encompassing a user, and automatically determines information specifying the location of the user in the field of view, and supplies the position information to a filter unit which at least in part isolate the user'"'"'s voice from other sounds in the user'"'"'s environment in response to the position information.
Specification