Speech recognition analysis via identification information
First Claim
1. In a computing system comprising a microphone array and an image sensor, a method of operating a speech recognition input system, the method comprising:
- receiving speech recognition data as an output from a speech recognition stage of an audio processing pipeline, the speech recognition data comprising a recognized speech segment and confidence data comprising a recognition confidence value that represents a confidence in a certainty of a match of the recognized speech segment to a speech pattern;
receiving acoustic locational data as an output from a digital audio processing stage of the audio processing pipeline, the acoustic locational data related to a location of origin of the recognized speech segment as determined via signals from the microphone array;
receiving image data comprising visual locational information related to a location of each person located in a field of view of the image sensor;
comparing the acoustic locational data to the visual locational information to determine whether the recognized speech segment originated from a person in the field of view of the image sensor; and
adjusting the confidence data based upon whether the recognized speech segment is determined to have originated from a person in the field of view of the image sensor.
2 Assignments
0 Petitions
Accused Products
Abstract
Embodiments are disclosed that relate to the use of identity information to help avoid the occurrence of false positive speech recognition events in a speech recognition system. One embodiment provides a method comprising receiving speech recognition data comprising a recognized speech segment, acoustic locational data related to a location of origin of the recognized speech segment as determined via signals from the microphone array, and confidence data comprising a recognition confidence value, and also receiving image data comprising visual locational information related to a location of each person in an image. The acoustic locational data is compared to the visual locational data to determine whether the recognized speech segment originated from a person in the field of view of the image sensor, and the confidence data is adjusted depending on this determination.
265 Citations
20 Claims
-
1. In a computing system comprising a microphone array and an image sensor, a method of operating a speech recognition input system, the method comprising:
-
receiving speech recognition data as an output from a speech recognition stage of an audio processing pipeline, the speech recognition data comprising a recognized speech segment and confidence data comprising a recognition confidence value that represents a confidence in a certainty of a match of the recognized speech segment to a speech pattern; receiving acoustic locational data as an output from a digital audio processing stage of the audio processing pipeline, the acoustic locational data related to a location of origin of the recognized speech segment as determined via signals from the microphone array; receiving image data comprising visual locational information related to a location of each person located in a field of view of the image sensor; comparing the acoustic locational data to the visual locational information to determine whether the recognized speech segment originated from a person in the field of view of the image sensor; and adjusting the confidence data based upon whether the recognized speech segment is determined to have originated from a person in the field of view of the image sensor. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. An interactive entertainment system, comprising:
-
a depth-sensing camera; a microphone array comprising a plurality of microphones; and a computing device comprising a processor and memory comprising instructions stored thereon that are executable by the processor to; receive speech recognition data as an output from a speech recognition stage of an audio processing pipeline, the speech recognition data comprising a recognized speech segment and confidence data comprising a recognition confidence value that represents a confidence in a certainty of a match of the recognized speech segment to a speech pattern; receive acoustic locational data as an output from a digital audio processing stage of the audio processing pipeline, the acoustic locational data related to a location of origin of the recognized speech segment as determined via signals from the microphone array; receive image data comprising visual locational information related to a location of each person located in a field of view of the depth-sensing camera; compare the acoustic locational data to the visual locational data to determine whether the recognized speech segment originated from a person in the field of view of an image sensor; and adjust the confidence data based upon whether the recognized speech segment is determined to have originated from a person in the field of view of the depth-sensing camera. - View Dependent Claims (10, 11, 12, 13, 14, 15)
-
-
16. A hardware computer-readable storage device comprising instructions stored thereon that are executable by a computing device to:
-
receive speech recognition data as an output from a speech recognition stage of an audio processing pipeline, the speech recognition stage being configured to compare inputs received from a digital audio processing stage of the audio processing pipeline to a plurality of recognized speech patterns to recognize speech inputs, and the speech recognition data comprising a recognized speech segment and confidence data comprising a recognition confidence value that represents a confidence in a certainty of a match of the recognized speech segment to a speech pattern; receive acoustic locational data as an output from a digital audio processing stage of the audio processing pipeline, the acoustic locational data related to a location of origin of the recognized speech segment as determined via signals from a microphone array; receive image data comprising visual locational information related to a location of each person located in a field of view of a depth-sensing camera; compare the acoustic locational data to the visual locational data to determine whether the recognized speech segment originated from a person in the field of view of an image sensor; adjust the confidence data based upon whether the recognized speech segment is determined to have originated from a person in the field of view of the depth-sensing camera; if it is determined that the recognized speech segment originated from a person in the field of view of the image sensor, then determine whether a face of the person is facing the image sensor; and adjusting the confidence data such that the recognition confidence value has a lower value after adjusting if the face of the person is not facing the image sensor than if the face of the person is facing the image sensor. - View Dependent Claims (17, 18, 19, 20)
-
Specification