Speech recognition analysis via identification information

US 8,676,581 B2
Filed: 01/22/2010
Issued: 03/18/2014
Est. Priority Date: 01/22/2010
Status: Active Grant

First Claim

Patent Images

1. In a computing system comprising a microphone array and an image sensor, a method of operating a speech recognition input system, the method comprising:

receiving speech recognition data as an output from a speech recognition stage of an audio processing pipeline, the speech recognition data comprising a recognized speech segment and confidence data comprising a recognition confidence value that represents a confidence in a certainty of a match of the recognized speech segment to a speech pattern;

receiving acoustic locational data as an output from a digital audio processing stage of the audio processing pipeline, the acoustic locational data related to a location of origin of the recognized speech segment as determined via signals from the microphone array;

receiving image data comprising visual locational information related to a location of each person located in a field of view of the image sensor;

comparing the acoustic locational data to the visual locational information to determine whether the recognized speech segment originated from a person in the field of view of the image sensor; and

adjusting the confidence data based upon whether the recognized speech segment is determined to have originated from a person in the field of view of the image sensor.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Embodiments are disclosed that relate to the use of identity information to help avoid the occurrence of false positive speech recognition events in a speech recognition system. One embodiment provides a method comprising receiving speech recognition data comprising a recognized speech segment, acoustic locational data related to a location of origin of the recognized speech segment as determined via signals from the microphone array, and confidence data comprising a recognition confidence value, and also receiving image data comprising visual locational information related to a location of each person in an image. The acoustic locational data is compared to the visual locational data to determine whether the recognized speech segment originated from a person in the field of view of the image sensor, and the confidence data is adjusted depending on this determination.

265 Citations

20 Claims

1. In a computing system comprising a microphone array and an image sensor, a method of operating a speech recognition input system, the method comprising:
- receiving speech recognition data as an output from a speech recognition stage of an audio processing pipeline, the speech recognition data comprising a recognized speech segment and confidence data comprising a recognition confidence value that represents a confidence in a certainty of a match of the recognized speech segment to a speech pattern;
  
  receiving acoustic locational data as an output from a digital audio processing stage of the audio processing pipeline, the acoustic locational data related to a location of origin of the recognized speech segment as determined via signals from the microphone array;
  
  receiving image data comprising visual locational information related to a location of each person located in a field of view of the image sensor;
  
  comparing the acoustic locational data to the visual locational information to determine whether the recognized speech segment originated from a person in the field of view of the image sensor; and
  
  adjusting the confidence data based upon whether the recognized speech segment is determined to have originated from a person in the field of view of the image sensor.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1, wherein adjusting the confidence data comprises adjusting the recognition confidence value such that the recognition confidence value has a lower value after adjusting if the recognized speech segment is determined not to have originated from a person in the field of view of the image sensor than if the recognized speech segment is determined to have originated from a person in the field of view of the image sensor.
  - 3. The method of claim 1, further comprising determining to reject the recognized speech segment as a speech input when the recognition confidence value is below a minimum confidence threshold.
  - 4. The method of claim 1, further comprising adjusting the confidence data based upon whether the recognized speech segment is determined to have originated from a recognized speaker.
  - 5. The method of claim 1, wherein, if the recognized speech segment is determined not to have originated from a recognized speaker and is determined not to have originated from a person in the field of view of the image sensor, then adjusting the confidence data comprises rejecting the recognized speech segment.
  - 6. The method of claim 1, wherein, if it is determined that the recognized speech segment originated from a person in the field of view of the image sensor, then determining whether a face of the person is facing the image sensor, and adjusting the recognition confidence value such that the recognition confidence value has a lower value after adjusting if the face of the person is not facing the image sensor than if the face of the person is facing the image sensor.
  - 7. The method of claim 1, further comprising receiving a speech input of a keyword before receiving the recognized speech segment, and wherein adjusting the confidence data comprises adjusting the recognition confidence value based upon an amount of time that passed between receiving the speech input of the keyword and receiving the recognized speech segment.
  - 8. The method of claim 1, wherein the image sensor is a depth-sensing camera, and wherein receiving image data comprising visual locational information comprises receiving image data comprising information related to a distance of each person in the field of view of the depth-sensing camera.

9. An interactive entertainment system, comprising:
- a depth-sensing camera;
  
  a microphone array comprising a plurality of microphones; and
  
  a computing device comprising a processor and memory comprising instructions stored thereon that are executable by the processor to;
  
  receive speech recognition data as an output from a speech recognition stage of an audio processing pipeline, the speech recognition data comprising a recognized speech segment and confidence data comprising a recognition confidence value that represents a confidence in a certainty of a match of the recognized speech segment to a speech pattern;
  
  receive acoustic locational data as an output from a digital audio processing stage of the audio processing pipeline, the acoustic locational data related to a location of origin of the recognized speech segment as determined via signals from the microphone array;
  
  receive image data comprising visual locational information related to a location of each person located in a field of view of the depth-sensing camera;
  
  compare the acoustic locational data to the visual locational data to determine whether the recognized speech segment originated from a person in the field of view of an image sensor; and
  
  adjust the confidence data based upon whether the recognized speech segment is determined to have originated from a person in the field of view of the depth-sensing camera.
- View Dependent Claims (10, 11, 12, 13, 14, 15)
- - 10. The interactive entertainment system of claim 9, wherein the instructions are executable to adjust the confidence data by adjusting the recognition confidence value such that the recognition confidence value has a lower value after adjusting if the recognized speech segment is determined not to have originated from a person in the field of view of the depth-sensing camera than if the recognized speech segment is determined to have originated from a person in the field of view of the depth-sensing camera.
  - 11. The interactive entertainment system of claim 9, wherein the instructions are further executable to determine to reject the recognized speech segment as a speech input when the confidence value is below a minimum confidence threshold.
  - 12. The interactive entertainment system of claim 9, wherein the instructions are further executable to:
    - determine if the recognized speech segment originated from a recognized speaker; and
      
      adjust the confidence data based upon whether the recognized speech segment is determined to have originated from a recognized speaker.
  - 13. The interactive entertainment system of claim 12, wherein the instructions are further executable to reject the recognized speech segment if the recognized speech segment is determined not to have originated from a recognized speaker and the recognized speech segment is determined not to have originated from a person in the field of view of the depth-sensing camera.
  - 14. The interactive entertainment system of claim 9, wherein the instructions are further executable to:
    - determine that the recognized speech segment originated from a person in the field of view of the image sensor,determine whether a face of the person is facing the image sensor; and
      
      adjust the confidence data such that the recognition confidence value has a lower value after adjusting if the face of the person is not facing the image sensor than if the face of the person is facing the image sensor.
  - 15. The interactive entertainment device of claim 9, further comprising receiving a speech input of a keyword before receiving the recognized speech segment, and wherein adjusting the confidence data comprises adjusting the recognized confidence value based upon an amount of time that passed between receiving the speech input of the keyword and receiving the recognized speech segment.

16. A hardware computer-readable storage device comprising instructions stored thereon that are executable by a computing device to:
- receive speech recognition data as an output from a speech recognition stage of an audio processing pipeline, the speech recognition stage being configured to compare inputs received from a digital audio processing stage of the audio processing pipeline to a plurality of recognized speech patterns to recognize speech inputs, and the speech recognition data comprising a recognized speech segment and confidence data comprising a recognition confidence value that represents a confidence in a certainty of a match of the recognized speech segment to a speech pattern;
  
  receive acoustic locational data as an output from a digital audio processing stage of the audio processing pipeline, the acoustic locational data related to a location of origin of the recognized speech segment as determined via signals from a microphone array;
  
  receive image data comprising visual locational information related to a location of each person located in a field of view of a depth-sensing camera;
  
  compare the acoustic locational data to the visual locational data to determine whether the recognized speech segment originated from a person in the field of view of an image sensor;
  
  adjust the confidence data based upon whether the recognized speech segment is determined to have originated from a person in the field of view of the depth-sensing camera;
  
  if it is determined that the recognized speech segment originated from a person in the field of view of the image sensor, then determine whether a face of the person is facing the image sensor; and
  
  adjusting the confidence data such that the recognition confidence value has a lower value after adjusting if the face of the person is not facing the image sensor than if the face of the person is facing the image sensor.
- View Dependent Claims (17, 18, 19, 20)
- - 17. The hardware computer-readable storage device of claim 16, wherein the instructions are further executable to:
    - determine if the recognized speech segment originated from a recognized speaker; and
      
      adjust the confidence data based upon whether the recognized speech segment is determined to have originated from a recognized speaker.
  - 18. The hardware computer-readable storage device of claim 17 wherein the instructions are executable to reject the recognized speech segment if the recognized speech segment is determined not to have originated from a recognized speaker and the recognized speech segment is determined not to have originated from a person in the field of view of the depth-sensing camera.
  - 19. The hardware computer-readable storage device of claim 16, wherein the instructions are further executable to receive a speech input of a keyword before receiving the recognized speech segment, and adjust the recognized confidence value based upon an amount of time that passed between receiving the speech input of the keyword and receiving the recognized speech segment.
  - 20. The hardware computer-readable storage device of claim 16, wherein the instructions are further executable to adjust the confidence data by one or more of adjusting the recognition confidence value and including an intended input confidence value in the confidence data.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Flaks, Jason, Hawkins, Dax, Klein, Christian, Dernis, Mitchell Stephen, Leyvand, Tommer, Vassigh, Ali M., McKay, Duncan
Primary Examiner(s)
Desir, Pierre-Louis
Assistant Examiner(s)
Sirjani, Fariba

Application Number

US12/692,538
Publication Number

US 20110184735A1
Time in Patent Office

1,516 Days
Field of Search

704/240, 704/270, 704/272, 704/275
US Class Current

704/240
CPC Class Codes

A63F 2300/1081   Input via voice recognition

A63F 2300/1087   comprising photodetecting m...

A63F 2300/6072   of an input signal, e.g. pi...

G06F 2218/22   Source localisation; Invers...

G06V 40/161   Detection; Localisation; No...

G10L 15/24   Speech recognition using no...

G10L 17/00   Speaker identification or v...

G10L 2015/228   of application context

G10L 2021/02166   Microphone arrays; Beamforming

Speech recognition analysis via identification information

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

265 Citations

20 Claims

Specification

Use Cases

Quick Links

Others

Speech recognition analysis via identification information

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

265 Citations

20 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others