Hybridized client-server speech recognition

US 10,217,463 B2
Filed: 04/28/2017
Issued: 02/26/2019
Est. Priority Date: 02/22/2011
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

receiving, at a recipient computing device, a speech utterance;

dynamically determining a confidence threshold value and an audio quality threshold value based on environmental conditions at which the recipient computing device is located, the environmental conditions comprising one or more of;

a type of environment in which the recipient computing device is located, availability of noise cancelling devices at the recipient computing device, and number of microphones used by the recipient computing device;

segmenting the speech utterance into two or more speech utterance segments, including performing an initial analysis on the speech utterance, to determine where to perform speech recognition processing for each of the two or more speech utterance segments, by applying to the speech utterance a dynamically adaptable acoustic model implemented at the recipient computing device, with the dynamically adaptable acoustic model adjusted based on locally available data at the recipient computing device, including a user location and time, to determine a confidence score and an audio quality metric for each of the two or more speech utterance segments;

assigning, based on the initial analysis performed by the adaptable acoustic model generating the determined confidence score and audio quality metric for the each of the two or more speech utterance segments, and based on the dynamically determined confidence threshold and the audio quality threshold, a first segment from the two or more speech utterance segments to a first speech recognizer implemented on a separate computing device than the recipient computing device, and a second segment from the two or more speech utterance segments to a second speech recognizer implemented on the recipient computing device;

sending the first segment from the recipient computing device to the separate computing device for processing;

receiving first segment processing results back from the separate computing device, the sending and the receiving occurring via a data network;

processing the second segment at the recipient computing device to generate second segment processing results; and

returning a completed speech recognition result assembled from the first segment processing results and the second segment processing results.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A recipient computing device can receive a speech utterance to be processed by speech recognition and segment the speech utterance into two or more speech utterance segments, each of which can be to one of a plurality of available speech recognizers. A first one of the plurality of available speech recognizers can be implemented on a separate computing device accessible via a data network. A first segment can be processed by the first recognizer and the results of the processing returned to the recipient computing device, and a second segment can be processed by a second recognizer implemented at the recipient computing device.

Citations

20 Claims

1. A method comprising:
- receiving, at a recipient computing device, a speech utterance;
  
  dynamically determining a confidence threshold value and an audio quality threshold value based on environmental conditions at which the recipient computing device is located, the environmental conditions comprising one or more of;
  
  a type of environment in which the recipient computing device is located, availability of noise cancelling devices at the recipient computing device, and number of microphones used by the recipient computing device;
  
  segmenting the speech utterance into two or more speech utterance segments, including performing an initial analysis on the speech utterance, to determine where to perform speech recognition processing for each of the two or more speech utterance segments, by applying to the speech utterance a dynamically adaptable acoustic model implemented at the recipient computing device, with the dynamically adaptable acoustic model adjusted based on locally available data at the recipient computing device, including a user location and time, to determine a confidence score and an audio quality metric for each of the two or more speech utterance segments;
  
  assigning, based on the initial analysis performed by the adaptable acoustic model generating the determined confidence score and audio quality metric for the each of the two or more speech utterance segments, and based on the dynamically determined confidence threshold and the audio quality threshold, a first segment from the two or more speech utterance segments to a first speech recognizer implemented on a separate computing device than the recipient computing device, and a second segment from the two or more speech utterance segments to a second speech recognizer implemented on the recipient computing device;
  
  sending the first segment from the recipient computing device to the separate computing device for processing;
  
  receiving first segment processing results back from the separate computing device, the sending and the receiving occurring via a data network;
  
  processing the second segment at the recipient computing device to generate second segment processing results; and
  
  returning a completed speech recognition result assembled from the first segment processing results and the second segment processing results.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method of claim 1, wherein assigning the first segment and the second segment comprises:
    - designating the first segment for processing by the first speech recognizer when at least one of the confidence score and the audio quality metric for the first segment, determined using the dynamically adaptable acoustic model, is below the respective confidence threshold value and the audio quality threshold value; and
      
      designating the second segment of the two or more speech utterance segments for processing by the second speech recognizer when another confidence score and another audio quality metric for the second segment, determined using the dynamically adaptable acoustic model, is above the respective confidence threshold value and the audio quality threshold value.
  - 3. The method of claim 1, further comprising:
    - determining an amount of an available bandwidth between the recipient computing device and the separate computing device;
      
      wherein segmenting the speech utterance further comprises;
      
      segmenting, upon determining that the available bandwidth is sufficient, the speech utterance into the two or more speech utterance segments, including performing the initial analysis for further identifying features of the speech utterance that can be more efficiently processed by the separate computing device than the recipient computing device.
  - 4. The method of claim 3, wherein identifying of the features of the speech utterance comprises:
    - determining processing speeds associated with the separate computing device and the recipient computing device, the available bandwidth, and a presence of a word or phrase capable of being efficiently modeled by a context-free grammar at the recipient computing device.
  - 5. The method of claim 3, wherein initially analyzing the speech utterance by identifying the features of the speech utterance that can be more efficiently processed by the separate computing device than the recipient computing device comprises:
    - identifying a feature of the speech utterance as commands-based speech data corresponding to the second segment to be processed at the second speech recognizer implemented at the recipient computing device; and
      
      identifying another feature of the speech utterance as additional information, including one or more of dictation data, music or video data, user acoustic profile data, or foreign language speech data, the additional information being related to the identified command-based speech data corresponding to the second segment, with the other feature of the speech utterance to be processed at the first speech recognizer implemented on the separate computing device.
  - 6. The method of claim 3, wherein identifying the features of the speech utterance comprises:
    - analyzing the speech utterance using the dynamically adaptable acoustic model implemented on one or more processors at the recipient computing device.
  - 7. The method of claim 3, wherein assigning each of the two or more speech utterance segments further comprises:
    - designating the first segment for processing by the first speech recognizer implemented on the separate computing device when the first segment is determined by the initially analyzing to include one or more words that relate to data that are more readily accessible at the separate computing device than at the recipient computing device.
  - 8. The method of claim 1, wherein sending the first segment from the recipient computing device to the separate computing device for processing comprises:
    - sending the first segment from the recipient computing device to the separate computing device for processing using an adapted language model that is one or more of;
      
      time sensitive, and location sensitive;
      
      wherein the second segment is processed by a local language model implemented at the recipient computing device.
  - 9. The method of claim 1, wherein the respective threshold values are determined based on one or more criteria defined at design time or dynamically evaluated at run time.
  - 10. The method of claim 1, wherein the recipient computing device comprises a thin client computing device or terminal, and the separate computing device comprises at least one server accessible over the data network from the thin client computing device or terminal.

11. A recipient computing device comprising:
- at least one programmable processor;
  
  a communication unit to communicate with remote computing devices; and
  
  a computer-readable storage medium, coupled to the at least one processor and the communication unit, storing instructions that, when executed by the at least one processor, cause the at least one programmable processor to;
  
  receive a speech utterance;
  
  dynamically determine a confidence threshold value and an audio quality threshold value based on environmental conditions at which the recipient computing device is located, the environmental conditions comprising one or more of;
  
  a type of environment in which the recipient computing device is located, availability of noise cancelling devices at the recipient computing device, and number of microphones used by the recipient computing device;
  
  segment the speech utterance into two or more speech utterance segments, including to perform an initial analysis on the speech utterance, in order to determine where to perform speech recognition processing for each of the two or more speech utterance segments, so as to apply to the speech utterance a dynamically adaptable acoustic model implemented at the recipient computing device, with the dynamically adaptable acoustic model adjusted based on locally available data at the recipient computing device, including a user location and time, to determine a confidence score and an audio quality metric for each of the two or more speech utterance segments;
  
  assign, based on the initial analysis performed by the adaptable acoustic model generating the determined confidence score and audio quality metric for the each of the two or more speech utterance segments, and based on the dynamically determined confidence threshold and the audio quality threshold, a first segment from the two or more speech utterance segments to a first speech recognizer implemented on a separate computing device than the recipient computing device, and a second segment from the two or more speech utterance segments to a second speech recognizer implemented on the recipient computing device;
  
  send the first segment from the recipient computing device to the separate computing device for processing;
  
  receive first segment processing results back from the separate computing device, the sending and the receiving occurring via a data network;
  
  process the second segment at the recipient computing device to generate second segment processing results; and
  
  return a completed speech recognition result assembled from the first segment processing results and the second segment processing results.
- View Dependent Claims (12, 13, 14, 15, 16)
- - 12. The recipient computing device of claim 11, wherein the instructions to cause the at least one programmable processor to assign the first segment and the second segment comprise one or more instructions to cause the at least one programmable processor to:
    - designate the first segment for processing by the first speech recognizer when at least one of the confidence score and the audio quality metric for the first segment, determined using the dynamically adaptable acoustic model, is below the respective confidence threshold value and the audio quality threshold value; and
      
      designate the second segment of the two or more speech utterance segments for processing by the second speech recognizer when another confidence score and another audio quality metric for the second segment, determined using the dynamically adaptable acoustic model, is above the respective confidence threshold value and the audio quality threshold value.
  - 13. The recipient computing device of claim 11, wherein the instructions comprise further instructions to cause the at least one programmable processor to:
    - determine an amount of an available bandwidth between the recipient computing device and the separate computing device;
      
      wherein the instruction to cause the at least one programmable processor to segment the speech utterance further comprises one or more instructions to;
      
      segment, upon determining that the available bandwidth is sufficient, the speech utterance into the two or more speech utterance segments, including to perform the initial analysis to further identify features of the speech utterance that can be more efficiently processed by the separate computing device than the recipient computing device.
  - 14. The recipient computing device of claim 13, wherein the one or more instructions to cause the at least one programmable processor to identify the features of the speech utterance comprises instructions to cause the at least one programmable processor to:
    - determine processing speeds associated with the separate computing device and the recipient computing device, the available bandwidth, and a presence of a word or phrase capable of being efficiently modeled by a context-free grammar at the recipient computing device.
  - 15. The recipient computing device of claim 13, wherein the instructions to cause the at least one programmable processor to initially analyze the speech utterance to identify the features of the speech utterance that can be more efficiently processed by the separate computing device than the recipient computing device comprises instructions to cause the at least one programmable processor to:
    - identify a feature of the speech utterance as commands-based speech data corresponding to the second segment to be processed at the second speech recognizer implemented at the recipient computing device; and
      
      identify another feature of the speech utterance as additional information, including one or more of dictation data, music or video data, user acoustic profile data, or foreign language speech data, the additional information being related to the identified command-based speech data corresponding to the second segment, with the other feature of the speech utterance to be processed at the first speech recognizer implemented on the separate computing device.
  - 16. The recipient computing device of claim 11, wherein the instructions to cause the at least one programmable processor to send the first segment from the recipient computing device to the separate computing device for processing comprises one or more instructions to cause the at least one programmable processor to:
    - send the first segment from the recipient computing device to the separate computing device for processing using an adapted language model that is one or more of;
      
      time sensitive, and location sensitive.

17. A computer program product comprising a non-transitory computer-readable storage medium storing instructions that, when executed by a computing system comprising at least one programmable processor, cause the computing system to perform operations comprising:
- receiving, at a recipient computing device, a speech utterance;
  
  dynamically determining a confidence threshold value and an audio quality threshold value based on environmental conditions at which the recipient computing device is located, the environmental conditions comprising one or more of;
  
  a type of environment in which the recipient computing device is located, availability of noise cancelling devices at the recipient computing device, and number of microphones used by the recipient computing device;
  
  segmenting the speech utterance into two or more speech utterance segments, including performing an initial analysis on the speech utterance, to determine where to perform speech recognition processing for each of the two or more speech utterance segments, by applying to the speech utterance a dynamically adaptable acoustic model implemented at the recipient computing device, with the dynamically adaptable acoustic model adjusted based on locally available data at the recipient computing device, including a user location and time, to determine a confidence score and an audio quality metric for each of the two or more speech utterance segments;
  
  assigning, based on the initial analysis performed by the adaptable acoustic model generating the determined confidence score and audio quality metric for the each of the two or more speech utterance segments, and based on the dynamically determined confidence threshold and the audio quality threshold, a first segment from the two or more speech utterance segments to a first speech recognizer implemented on a separate computing device than the recipient computing device, and a second segment from the two or more speech utterance segments to a second speech recognizer implemented on the recipient computing device;
  
  sending the first segment from the recipient computing device to the separate computing device for processing;
  
  receiving first segment processing results back from the separate computing device, the sending and the receiving occurring via a data network;
  
  processing the second segment at the recipient computing device to generate second segment processing results; and
  
  returning a completed speech recognition result assembled from the first segment processing results and the second segment processing results.
- View Dependent Claims (18, 19, 20)
- - 18. The computer program product of claim 17, wherein assigning the first segment and the second segment comprises:
    - designating the first segment for processing by the first speech recognizer when at least one of the confidence score and the audio quality metric for the first segment, determined using the dynamically adaptable acoustic model, is below the respective confidence threshold value and the audio quality threshold value; and
      
      designating the second segment of the two or more speech utterance segments for processing by the second speech recognizer when another confidence score and another audio quality metric for the second segment, determined using the dynamically adaptable acoustic model, is above the respective confidence threshold value and the audio quality threshold value.
  - 19. The computer program product of claim 17, wherein the instructions comprise further instructions to cause the at least one programmable processor to perform further operations comprising:
    - determining an amount of an available bandwidth between the recipient computing device and the separate computing device;
      
      wherein segmenting the speech utterance further comprises;
      
      segmenting, upon determining that the available bandwidth is sufficient, the speech utterance into the two or more speech utterance segments, including performing the initial analysis for further identifying features of the speech utterance that can be more efficiently processed by the separate computing device than the recipient computing device.
  - 20. The computer program product of claim 19, wherein identifying of the features of the speech utterance comprises:
    - determining processing speeds associated with the separate computing device and the recipient computing device, the available bandwidth, and a presence of a word or phrase capable of being efficiently modeled by a context-free grammar at the recipient computing device.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Speak With Me, Inc.
Original Assignee
Speak With Me, Inc.
Inventors
Juneja, Ajay
Primary Examiner(s)
Lerner, Martin

Application Number

US15/581,269
Publication Number

US 20170229122A1
Time in Patent Office

669 Days
Field of Search

704231, 704233, 704236, 7042701, 704255
US Class Current
CPC Class Codes

G10L 15/005   Language recognition

G10L 15/02   Feature extraction for spee...

G10L 15/04   Segmentation; Word boundary...

G10L 15/08   Speech classification or se...

G10L 15/193   Formal grammars, e.g. finit...

G10L 15/22   Procedures used during a sp...

G10L 15/30   Distributed recognition, e....

G10L 15/32   Multiple recognisers used i...

G10L 25/00   Speech or voice analysis te...

G10L 25/60   for measuring the quality o...

H04L 43/08   Monitoring or testing based...

H04L 43/0894   Packet rate

H04L 67/01   Protocols

H04L 67/63   Routing a service request d...

H04M 1/271   controlled by voice recogni...

H04M 1/72412   using two-way short-range w...

H04M 1/72436   for text messaging, e.g. sh...

H04M 2250/14   including a card reading de...

H04M 2250/74   with voice recognition means

Hybridized client-server speech recognition

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Hybridized client-server speech recognition

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links