SPEECH RECOGNITION ANALYSIS VIA IDENTIFICATION INFORMATION

US 20110184735A1
Filed: 01/22/2010
Published: 07/28/2011
Est. Priority Date: 01/22/2010
Status: Active Grant

First Claim

Patent Images

1. In a computing system comprising a microphone array and an image sensor, a method of operating a speech recognition input system, the method comprising:

receiving speech recognition data comprising a recognized speech segment, acoustic locational data related to a location of origin of the recognized speech segment as determined via signals from the microphone array, and confidence data comprising a recognition confidence value;

receiving image data comprising visual locational information related to a location of each person located in a field of view of the image sensor;

comparing the acoustic locational data to the visual locational data to determine whether the recognized speech segment originated from a person in the field of view of the image sensor; and

adjusting the confidence data based upon whether the recognized speech segment is determined to have originated from a person in the field of view of the image sensor.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Embodiments are disclosed that relate to the use of identity information to help avoid the occurrence of false positive speech recognition events in a speech recognition system. One embodiment provides a method comprising receiving speech recognition data comprising a recognized speech segment, acoustic locational data related to a location of origin of the recognized speech segment as determined via signals from the microphone array, and confidence data comprising a recognition confidence value, and also receiving image data comprising visual locational information related to a location of each person in an image. The acoustic locational data is compared to the visual locational data to determine whether the recognized speech segment originated from a person in the field of view of the image sensor, and the confidence data is adjusted depending on this determination.

174 Citations

View as Search Results

20 Claims

1. In a computing system comprising a microphone array and an image sensor, a method of operating a speech recognition input system, the method comprising:
- receiving speech recognition data comprising a recognized speech segment, acoustic locational data related to a location of origin of the recognized speech segment as determined via signals from the microphone array, and confidence data comprising a recognition confidence value;
  
  receiving image data comprising visual locational information related to a location of each person located in a field of view of the image sensor;
  
  comparing the acoustic locational data to the visual locational data to determine whether the recognized speech segment originated from a person in the field of view of the image sensor; and
  
  adjusting the confidence data based upon whether the recognized speech segment is determined to have originated from a person in the field of view of the image sensor.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 17, 18)
- - 2. The method of claim 1, wherein adjusting the confidence data comprises lowering the recognition confidence value.
  - 3. The method of claim 1, wherein adjusting the confidence data comprises determining an intended input confidence value configured to communicate a level of confidence in whether the recognized speech segment came from an active user.
  - 4. The method of claim 1, further comprising adjusting the confidence data based upon whether the recognized speech segment is determined to have originated from a recognized speaker.
  - 5. The method of claim 1, wherein, if the recognized speech segment is determined not to have originated from a recognized speaker and is determined not to have originated from a person in the field of view of the image sensor, then adjusting the confidence data comprises rejecting the recognized speech segment.
  - 6. The method of claim 1, wherein, if is determined that the recognized speech segment originated from a person in the field of view of the image sensor, then determining whether a face of the person is facing the image sensor, and adjusting the confidence data based upon whether the face of the person is facing the image sensor.
  - 7. The method of claim 1, further comprising receiving a speech input of a keyword before receiving the recognized speech segment, and wherein adjusting the confidence data comprises adjusting the confidence data based upon an amount of time that passed between receiving the speech input of the keyword and receiving the recognized speech segment.
  - 8. The method of claim 1, wherein the image sensor is a depth-sensing camera, and wherein receiving image data comprising visual locational information comprises receiving image data comprising information related to a distance of each person in the field of view of the depth-sensing camera.
  - 17. The computer-readable storage medium of claim 3, wherein the instructions are further executable to:
    - determine if the recognized speech segment originated from a recognized speaker; and
      
      adjust the confidence data based upon whether the recognized speech segment is determined to have originated from a recognized speaker.
  - 18. The method of claim 17 wherein the instructions are executable to reject the recognized speech segment if the recognized speech segment is determined not to have originated from a recognized speaker and the recognized speech segment is determined not to have originated from a person in the field of view of the depth-sensing camera.

9. An interactive entertainment system, comprising:
- a depth-sensing camera;
  
  a microphone array comprising a plurality of microphones; and
  
  a computing device comprising a processor and memory comprising instructions stored thereon that are executable by the processor to;
  
  receive speech recognition data comprising a recognized speech segment, acoustic locational data related to a location of origin of the recognized speech segment as determined via signals from the microphone array, and confidence data comprising a recognition confidence value;
  
  receive image data comprising visual locational information related to a location of each of each person located in a field of view of the depth-sensing camera;
  
  compare the acoustic locational data to the visual locational data to determine whether the recognized speech segment originated from a person in the field of view of the image sensor; and
  
  adjust the confidence data based upon whether the recognized speech segment is determined to have originated from a person in the field of view of the depth-sensing camera.
- View Dependent Claims (10, 11, 12, 13, 14, 15)
- - 10. The interactive entertainment system of claim 9, wherein the instructions are executable to adjust the confidence data by lowering the recognition confidence value.
  - 11. The interactive entertainment system of claim 9, wherein the instructions are executable to adjust the confidence data by determining and including an intended input confidence value configured to communicate a level of confidence in whether the recognized speech segment came from an active user.
  - 12. The interactive entertainment system of claim 9, wherein the instructions are further executable to:
    - determine if the recognized speech segment originated from a recognized speaker, andadjust the confidence data based upon whether the recognized speech segment is determined to have originated from a recognized speaker.
  - 13. The interactive entertainment system of claim 12, wherein the instructions are further executable to reject the recognized speech segment if the recognized speech segment is determined not to have originated from a recognized speaker and the recognized speech segment is determined not to have originated from a person in the field of view of the depth-sensing camera.
  - 14. The interactive entertainment system of claim 9, wherein the instructions are further executable to:
    - determine that the recognized speech segment originated from a person in the field of view of the image sensor,determine whether a face of the person is facing the image sensor, andadjust the confidence data based upon whether the face of the person is facing the image sensor.
  - 15. The interactive entertainment device of claim 9, further comprising receiving a speech input of a keyword before receiving the recognized speech segment, and wherein adjusting the confidence data comprises adjusting the confidence data based upon an amount of time that passed between receiving the speech input of the keyword and receiving the recognized speech segment.

16. A computer-readable storage medium comprising instructions stored thereon that are executable by a computing device to:
- receive speech recognition data comprising a recognized speech segment, acoustic locational data related to a location of origin of the recognized speech segment as determined via signals from the microphone array, and confidence data comprising a recognition confidence value;
  
  receive image data comprising visual locational information related to a location of each of each person located in a field of view of the depth-sensing camera;
  
  compare the acoustic locational data to the visual locational data to determine whether the recognized speech segment originated from a person in the field of view of the image sensor;
  
  adjust the confidence data based upon whether the recognized speech segment is determined to have originated from a person in the field of view of the depth-sensing camera;
  
  if is determined that the recognized speech segment originated from a person in the field of view of the image sensor, then determine whether a face of the person is facing the image sensor; and
  
  adjusting the confidence data based upon whether the face of the person is facing the image sensor.
- View Dependent Claims (19, 20)
- - 19. The computer-readable storage medium of claim 16, wherein the instructions are further executable to receiving a speech input of a keyword before receiving the recognized speech segment, and adjust the confidence data based upon an amount of time that passed between receiving the speech input of the keyword and receiving the recognized speech segment.
  - 20. The computer-readable storage medium of claim 16, wherein the confidence data comprises a recognition confidence value, and wherein the instructions are further executable to adjust the confidence data by one or more of adjusting the recognition confidence value and including an intended input confidence value in the confidence data.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Vassigh, Ali M., Dernis, Mitchell Stephen, Hawkins, Dax, Klein, Christian, Leyvand, Tommer, Flaks, Jason, McKay, Duncan

Granted Patent

US 8,676,581 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/240
CPC Class Codes

A63F 2300/1081   Input via voice recognition

A63F 2300/1087   comprising photodetecting m...

A63F 2300/6072   of an input signal, e.g. pi...

G06F 2218/22   Source localisation; Invers...

G06V 40/161   Detection; Localisation; No...

G10L 15/24   Speech recognition using no...

G10L 17/00   Speaker identification or v...

G10L 2015/228   of application context

G10L 2021/02166   Microphone arrays; Beamforming

SPEECH RECOGNITION ANALYSIS VIA IDENTIFICATION INFORMATION

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

174 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

SPEECH RECOGNITION ANALYSIS VIA IDENTIFICATION INFORMATION

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

174 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links