Integrated local and cloud based speech recognition

US 8,660,847 B2
Filed: 09/02/2011
Issued: 02/25/2014
Est. Priority Date: 09/02/2011
Status: Active Grant

First Claim

Patent Images

1. A method for performing speech recognition, comprising:

acquiring a plurality of audio signals from a plurality of microphones, each of the plurality of audio signals is associated with a different microphone of the plurality of microphones, the plurality of audio signals is associated with a first environment;

determining one or more directions within the first environment, the first environment includes one or more persons, each of the one or more directions is associated with a different person of the one or more persons;

acquiring one or more images of the first environment using a capture device, the plurality of audio signals are associated with the first environment during a first period of time, the one or more images are associated with the first environment during the first period of time, the one or more images include one or more depth images, the determining one or more directions includes performing skeletal tracking based on the one or more images for each of the one or more persons;

generating one or more audio recordings based on the plurality of audio signals, a first audio recording of the one or more audio recordings is generated by applying audio signal processing techniques to the plurality of audio signals such that sounds originating from a first direction of the one or more directions are amplified while other sounds originating from one or more other directions are attenuated;

performing local speech recognition on each of the one or more audio recordings, the performing local speech recognition includes detecting a first utterance and detecting one or more keywords within the first utterance, the first utterance is detected by applying one or more speech detection techniques to the first audio recording of the one or more audio recordings;

transmitting the first utterance and the one or more keywords to a second computing device, the second computing device performs a speech recognition technique on the first utterance, the speech recognition technique detects one or more words within the first utterance; and

receiving a first response from the second computing device based on the first utterance.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system for integrating local speech recognition with cloud-based speech recognition in order to provide an efficient natural user interface is described. In some embodiments, a computing device determines a direction associated with a particular person within an environment and generates an audio recording associated with the direction. The computing device then performs local speech recognition on the audio recording in order to detect a first utterance spoken by the particular person and to detect one or more keywords within the first utterance. The first utterance may be detected by applying voice activity detection techniques to the audio recording. The first utterance and the one or more keywords are subsequently transferred to a server which may identify speech sounds within the first utterance associated with the one or more keywords and adapt one or more speech recognition techniques based on the identified speech sounds.

51 Citations

View as Search Results

19 Claims

1. A method for performing speech recognition, comprising:
- acquiring a plurality of audio signals from a plurality of microphones, each of the plurality of audio signals is associated with a different microphone of the plurality of microphones, the plurality of audio signals is associated with a first environment;
  
  determining one or more directions within the first environment, the first environment includes one or more persons, each of the one or more directions is associated with a different person of the one or more persons;
  
  acquiring one or more images of the first environment using a capture device, the plurality of audio signals are associated with the first environment during a first period of time, the one or more images are associated with the first environment during the first period of time, the one or more images include one or more depth images, the determining one or more directions includes performing skeletal tracking based on the one or more images for each of the one or more persons;
  
  generating one or more audio recordings based on the plurality of audio signals, a first audio recording of the one or more audio recordings is generated by applying audio signal processing techniques to the plurality of audio signals such that sounds originating from a first direction of the one or more directions are amplified while other sounds originating from one or more other directions are attenuated;
  
  performing local speech recognition on each of the one or more audio recordings, the performing local speech recognition includes detecting a first utterance and detecting one or more keywords within the first utterance, the first utterance is detected by applying one or more speech detection techniques to the first audio recording of the one or more audio recordings;
  
  transmitting the first utterance and the one or more keywords to a second computing device, the second computing device performs a speech recognition technique on the first utterance, the speech recognition technique detects one or more words within the first utterance; and
  
  receiving a first response from the second computing device based on the first utterance.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
- - 2. The method of claim 1, wherein:
    - the second computing device identifies one or more speech sounds associated with the one or more keywords as pronounced within the first utterance, the second computing device adapts the speech recognition technique based on the one or more speech sounds; and
      
      the first response includes text information associated with the one or more words detected within the first utterance by the second computing device.
  - 3. The method of claim 1, wherein:
    - the transmitting the first utterance and the one or more keywords includes transmitting an audio file associated with the first utterance and transmitting text information associated with the one or more keywords to the second computing device; and
      
      the first utterance is detected by applying one or more voice activity detection techniques to the first audio recording.
  - 4. The method of claim 1, further comprising:
    - transmitting one or more location pointers associated with the one or more keywords to the second computing device, the detecting one or more keywords within the first utterance includes determining the one or more location pointers within the first utterance.
  - 5. The method of claim 1, further comprising:
    - performing echo cancellation on the plurality of audio signals prior to the determining one or more directions.
  - 6. The method of claim 1, wherein:
    - the determining one or more directions includes performing sound source localization, the performing sound source localization includes determining an angle and a degree of confidence associated with each of the one or more persons.
  - 7. The method of claim 1, wherein:
    - the first audio recording is generated in response to determining that a first person of the one or more persons is facing a particular direction, the determining that a first person of the one or more persons is facing a particular direction includes determining that a face of the first person is facing towards the capture device.
  - 8. The method of claim 1, wherein:
    - the generating one or more audio recordings includes performing beamforming techniques for each of the one or more directions.
  - 9. The method of claim 1, further comprising:
    - determining context information associated with a first computing device, the performing local speech recognition is performed on the first computing device; and
      
      transmitting the context information to the second computing device.
  - 10. The method of claim 9, wherein:
    - the context information includes identification of a particular application running on the first computing device.
  - 11. The method of claim 1, further comprising:
    - determining context information associated with a particular person of the one or more persons; and
      
      transmitting the context information to the second computing device.
  - 12. The method of claim 11, wherein:
    - the context information includes profile information associated with the particular person.
  - 13. The method of claim 12, wherein:
    - the performing local speech recognition is performed on a mobile computing device; and
      
      the first response includes Internet search results based on the first utterance and the context information.

14. An electronic device for integrating local and cloud-based speech recognition, comprising:
- a capture device including a plurality of microphones, the capture device acquires one or more sounds from the plurality of microphones, the one or more sounds are associated with a first environment, the first environment includes one or more persons, the capture device acquires one or more images of the first environment, the one or more images include one or more depth images; and
  
  one or more processors, the one or more processors determine one or more directions within the first environment by performing skeletal tracking based on the one or more depth images for each of the one or more persons, each of the one or more directions is associated with a different person of the one or more persons, the one or more processors generate one or more audio recordings based on the one or more sounds, each of the one or more audio recordings is associated with a different direction of the one or more directions, the one or more processors detect a first utterance within a first audio recording of the one or more audio recordings by applying one or more speech detection techniques to the first audio recording, the one or more processors detect one or more keywords within the first utterance, the one or more processors transmit the first utterance and the one or more keywords to a second computing device, the second computing device performs a speech recognition technique on the first utterance based on the one or more keywords, the speech recognition technique detects one or more words within the first utterance, the one or more processors receive a first response from the second computing device based on the first utterance.
- View Dependent Claims (15, 16)
- - 15. The electronic device of claim 14, wherein:
    - the second computing device identifies one or more speech sounds associated with the one or more keywords as pronounced within the first utterance, the second computing device adapts the speech recognition technique based on the one or more speech sounds.
  - 16. The method of claim 15, wherein:
    - the one or more processors identify a particular application being executed on the electronic device, the one or more processors transmit identification information associated with the particular application to the second computing device, the second computing device performs an Internet search based on the identification information and the one or more words detected within the first utterance; and
      
      the first response includes Internet search results based on the identification information and the first utterance.

17. One or more storage devices containing processor readable code for programming one or more processors to perform a method for integrating local and cloud-based speech recognition comprising the steps of:
- receiving one or more sounds from a plurality of microphones, the one or more sounds are associated with a first environment during a first time period;
  
  receiving one or more depth images of the first environment, the one or more depth images are associated with the first environment during the first time period;
  
  determining one or more locations within the first environment, the first environment includes one or more persons, each of the one or more directions is associated with a different person of the one or more persons, the determining one or more locations includes performing skeletal tracking based on the one or more depth images for each of the one or more persons;
  
  generating one or more audio recordings based on the one or more sounds, each of the one or more audio recordings is associated with a different location of the one or more locations;
  
  performing local speech recognition on each of the one or more audio recordings, the performing local speech recognition includes detecting a first utterance and detecting one or more keywords within the first utterance, the first utterance is detected by applying one or more speech detection techniques to a first audio recording of the one or more audio recordings;
  
  transmitting the first utterance and the one or more keywords to a second computing device, the second computing device performs a speech recognition technique on the first utterance based on the one or more keywords, the speech recognition technique detects one or more words within the first utterance; and
  
  receiving a first response from the second computing device based on the first utterance.
- View Dependent Claims (18, 19)
- - 18. The one or more storage devices of claim 17, wherein:
    - the second computing device identifies one or more speech sounds associated with the one or more keywords as pronounced within the first utterance, the second computing device configures the speech recognition technique based on the one or more speech sounds.
  - 19. The one or more storage devices of claim 18, wherein:
    - the determining one or more locations includes performing sound source localization, the performing sound source localization includes determining an angle and a degree of confidence associated with each of the one or more persons; and
      
      the generating one or more audio recordings includes performing beamforming techniques for each of the one or more directions.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Soemo, Thomas M., Soong, Leo, Kim, Michael H., Heinemann, Chad R., Hawkins, Dax H.
Primary Examiner(s)
He, Jialong
Assistant Examiner(s)
Shan, Jie

Application Number

US13/224,778
Publication Number

US 20130060571A1
Time in Patent Office

907 Days
Field of Search

704/251
US Class Current

704/251
CPC Class Codes

G06F 3/011   Arrangements for interactio...

G06F 3/167   Audio in a user interface, ...

G10L 15/30   Distributed recognition, e....

Integrated local and cloud based speech recognition

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

51 Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

Integrated local and cloud based speech recognition

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

51 Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links