Low latency audio interface

US 10,079,021 B1
Filed: 12/18/2015
Issued: 09/18/2018
Est. Priority Date: 12/18/2015
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

receiving, from an electronic device, audio input data representing a request;

performing speech recognition on the audio input data to obtain word data;

using natural language understanding (NLU) techniques on the word data to determine a topic associated with the request;

generating first audio output data including first words that are related to the topic;

determining that additional processing is needed to generate information responsive to the request;

generating second audio output data including second words that are related to the topic, the second words including transitional words between the first audio output data and the second audio output data, wherein the first audio output data and the second audio output data are generated at least partially in parallel;

prior to all of the audio input data being received from the electronic device, sending at least a portion of the first audio output data to the electronic device;

sending, at least partially in parallel to the first audio output data being sent, a communication to an interface associated with a skill to determine the information responsive to the request;

receiving, from the interface, the information responsive to the request;

generating third audio output data that includes the information responsive to the request and at least one additional transitional word between the second audio output data and the third audio output data,wherein the third audio output data and the second audio output data are generated at least partially in parallel;

prior to all of the audio input data being received from the electronic device, sending at least a portion of the second audio output data to the electronic device, wherein sending the at least a portion of the second audio output data occurs at least partially in parallel with generating the third audio output data; and

sending the third audio output data to the electronic device.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and methods for utilizing incremental processing of portions of output data to limit the time required to provide a response to a user request are provided herein. In some embodiments, portions of the user request for information can be analyzed using techniques such as automatic speech recognition (ASR), speech-to-text (STT), and natural language understanding (NLU) to determine the overall topic of the user request. One the topic has been determined, portions of the anticipated audio output data can be synthesized independently instead of waiting for the complete response. The synthesized portions can then be provided to the electronic device in anticipation of being output through one or more speakers on the electronic device, which speeds up the time that the response can be provided to the user.

35 Citations

View as Search Results

13 Claims

1. A method comprising:
- receiving, from an electronic device, audio input data representing a request;
  
  performing speech recognition on the audio input data to obtain word data;
  
  using natural language understanding (NLU) techniques on the word data to determine a topic associated with the request;
  
  generating first audio output data including first words that are related to the topic;
  
  determining that additional processing is needed to generate information responsive to the request;
  
  generating second audio output data including second words that are related to the topic, the second words including transitional words between the first audio output data and the second audio output data, wherein the first audio output data and the second audio output data are generated at least partially in parallel;
  
  prior to all of the audio input data being received from the electronic device, sending at least a portion of the first audio output data to the electronic device;
  
  sending, at least partially in parallel to the first audio output data being sent, a communication to an interface associated with a skill to determine the information responsive to the request;
  
  receiving, from the interface, the information responsive to the request;
  
  generating third audio output data that includes the information responsive to the request and at least one additional transitional word between the second audio output data and the third audio output data,wherein the third audio output data and the second audio output data are generated at least partially in parallel;
  
  prior to all of the audio input data being received from the electronic device, sending at least a portion of the second audio output data to the electronic device, wherein sending the at least a portion of the second audio output data occurs at least partially in parallel with generating the third audio output data; and
  
  sending the third audio output data to the electronic device.
- View Dependent Claims (2)
- - 2. The method of claim 1, wherein each of the first audio output data, the second audio output data, and the third audio output data are generated by evaluating a model cost associated with the first audio output data, the second audio output data, and the third audio output data, respectively, and by evaluating concatenation costs associated with transitions between the first words and the second words in the first audio output data and second audio output data, and the second words and third words in the second audio output data and the third audio output data.

3. A method, comprising:
- receiving, from an electronic device, audio input data representing a first series of words associated with a request;
  
  determining, using at least one natural language understanding (NLU) component, a topic to which the request relates;
  
  generating first audio output data representing at least a first word, the at least first word being associated with the topic;
  
  accessing an interface associated with a skill to determine information responsive to the request, the skill being associated with the topic,wherein the accessing and generating of the first audio output data are performed at least partially in parallel;
  
  prior to all of the audio input data being completely received, sending the first audio output data to the electronic device;
  
  generating second audio output data that includes at least a second word based at least in part on the received information responsive to the request, wherein the second audio output data is generated at least partially in parallel with sending the first audio output data to the electronic device;
  
  andsending the second audio output data to the electronic device.
- View Dependent Claims (4, 5, 6, 7, 8, 9, 10)
- - 4. The method of claim 3, further comprising:
    - generating third audio output data between the first audio output data and the second audio output data,wherein the first audio output data is further based on a first transitional word between the first audio output data and the third audio output data, and the second audio output data is further based on a second transitional word between the third audio output data and the second audio output data, the first audio output data being sent to the electronic device prior to the second audio output data being generated.
  - 5. The method of claim 4,wherein third audio output data is sent to the electronic device before the information responsive to the request is determined.
  - 6. The method of claim 3, wherein the first audio output data and the second audio output data are generated by evaluating a model cost and concatenation costs associated with a transition between the first audio output data and the second audio output data.
  - 7. The method of claim 3, wherein generating the first audio output data comprises:
    - determining potential units of speech to be included in the first audio output data;
      
      configuring different ordered combinations of the potential units of speech;
      
      evaluating a plurality of concatenation costs associated with a plurality of transitions between different potential units of speech;
      
      compiling a sum of the concatenation costs for the different ordered combinations; and
      
      selecting an ordered combination having a lowest concatenation cost based on the sum such that the first audio output data is generated based on the potential units of speech in the ordered combination that is selected.
  - 8. The method of claim 3, wherein generating the second audio output data further comprises:
    - generating the second audio output data after the information responsive to the request has been received.
  - 9. The method of claim 3, further comprising:
    - determining, prior to sending the first audio output data, a model cost associated with a smoothness of the first audio output data.
  - 10. The method of claim 3, further comprising:
    - sending the electronic device sequence information for playing the first audio output data prior to playing the second audio output data.

11. A system comprising:
- communications circuitry that receives, from an electronic device, audio input data representing a first series of words associated with a request; and
  
  at least one processor operable to;
  
  use natural language understanding (NLU) techniques on word data to determine a topic associated with the request;
  
  generate a first audio output data representing at least a first word, the first word being associated with the topic;
  
  communicate with an interface associated with a skill to determine information responsive to the request, wherein the generation of the first audio output data is performed at least partially in parallel to the communication with the interfaceprior to all portions of the audio input data being received from the electronic device, initiate, the communications circuitry to send the first audio output data to the electronic device;
  
  generate, second audio output data, that includes at least a second word based at least in part on the received information responsive to the request, wherein the second audio output data is generated at least in partially in parallel with sending the first audio output data to the electronic device; and
  
  initiate, the communications circuitry to send the second audio output data to the electronic device.
- View Dependent Claims (12, 13)
- - 12. The system of claim 11, wherein generation of the first audio output data causes the at least one processor to be further operable to:
    - arrange potential units of speech in a plurality of different sequences;
      
      evaluate a cost associated with a plurality of transitions from one potential unit to the next;
      
      sum the costs associated with all transitions for a given different sequence;
      
      and select a sequence from the plurality based at least in part on which sequence had a lowest sum.
  - 13. The system of claim 11, wherein at least a majority of the first audio output data and the second audio output data are sent to the electronic device prior to all of the audio input data being received from the electronic device.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Original Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Inventors
Barra Chicote, Roberto, Nadolski, Adam Franciszek
Primary Examiner(s)
Wozniak, James

Application Number

US14/974,872
Time in Patent Office

1,005 Days
Field of Search

704260, 704257, 704270, 7042701, 704275
US Class Current
CPC Class Codes

G10L 13/00   Speech synthesis; Text to s...

G10L 13/027   Concept to speech synthesis...

G10L 13/04   Details of speech synthesis...

G10L 15/18   using natural language mode...

G10L 15/22   Procedures used during a sp...

G10L 15/30   Distributed recognition, e....

G10L 15/32   Multiple recognisers used i...

Low latency audio interface

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

35 Citations

13 Claims

Specification

Solutions

Use Cases

Quick Links

Low latency audio interface

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

35 Citations

13 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links