Reduced latency text-to-speech system

US 9,646,601 B1
Filed: 07/26/2013
Issued: 05/09/2017
Est. Priority Date: 07/26/2013
Status: Active Grant

First Claim

Patent Images

1. A method for reducing a time for delivery of initial results of text-to-speech (TTS) processing, comprising:

receiving a TTS request including text for TTS processing, the text comprising a first portion of the text and a second portion of the text and wherein the first portion corresponds to a beginning of the text;

matching the first portion of text to text of a previously stored text sample;

retrieving speech unit identifiers associated with the previously stored text sample;

identifying speech units associated with the speech unit identifiers;

retrieving audio corresponding to the speech units;

generating first audio data by synthesizing first speech corresponding to the first portion of the text using the retrieved audio;

providing the first audio data;

generating second audio data by synthesizing second speech corresponding to the second portion of the text using a unit selection TTS technique; and

providing the second audio data.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

In delivering text-to-speech (TTS) results to a user, the time between the user request and delivery of initial TTS results is reduced using one or more of various techniques. Caching of TTS results may be reconfigured to cache unit indices rather than full speech synthesis results. More powerful computing resources may be dedicated to early TTS processing. A user may be notified of TTS results prior to complete processing of a TTS request. Early TTS processing may be performed by a local device and then passed to a remote device.

12 Citations

View as Search Results

21 Claims

1. A method for reducing a time for delivery of initial results of text-to-speech (TTS) processing, comprising:
- receiving a TTS request including text for TTS processing, the text comprising a first portion of the text and a second portion of the text and wherein the first portion corresponds to a beginning of the text;
  
  matching the first portion of text to text of a previously stored text sample;
  
  retrieving speech unit identifiers associated with the previously stored text sample;
  
  identifying speech units associated with the speech unit identifiers;
  
  retrieving audio corresponding to the speech units;
  
  generating first audio data by synthesizing first speech corresponding to the first portion of the text using the retrieved audio;
  
  providing the first audio data;
  
  generating second audio data by synthesizing second speech corresponding to the second portion of the text using a unit selection TTS technique; and
  
  providing the second audio data.
- View Dependent Claims (2, 3)
- - 2. The method of claim 1, further comprising allocating increased computing resources to processing of the first portion of the text than to the second portion of the text.
  - 3. The method of claim 1, further comprising:
    - estimating a time for delivery of initial results corresponding to the TTS request to a user; and
      
      determining that the time for delivery exceeds a threshold, and wherein the matching is performed based at least in part on the determining.

4. A method comprising:
- receiving a text-to-speech (TTS) request including text for TTS processing;
  
  identifying a stored text sample corresponding to a first portion of the text;
  
  retrieving speech unit identifiers associated with the stored text sample;
  
  identifying speech units associated with the speech unit identifiers;
  
  retrieving audio corresponding to the speech units;
  
  synthesizing a first portion of speech based at least in part on the retrieved audio; and
  
  synthesizing a second portion of speech based at least in part on the text.
- View Dependent Claims (5, 6, 7, 8, 9, 10, 11, 12)
- - 5. The method of claim 4, wherein the second portion of speech is synthesized from the text for TTS processing with at least the first portion of text removed.
  - 6. The method of claim 4, wherein the stored text sample is identified from a plurality of candidate stored text samples, and wherein the method further comprises determining a plurality of candidate stored text samples based at least in part on an application originating the TTS request, a popularity of the stored text sample, a user originating the TTS request, an intended user of results for the TTS request, a time associated with the TTS request, or a location for delivery of TTS results.
  - 7. The method of claim 4, wherein the stored text sample is identified from a plurality of candidate stored text samples, wherein the candidate stored text samples are selected based at least in part on a frequency of requests to synthesize text corresponding to the candidate stored text samples.
  - 8. The method of claim 4, further comprising allocating increased computing resources to processing of the first portion of the text than to a second portion of the text.
  - 9. The method of claim 8, wherein the allocating is based at least in part an estimated latency for delivery of the first portion of speech to a user.
  - 10. The method of claim 4, wherein the synthesizing the first portion of speech is performed on a local device.
  - 11. The method of claim 4, wherein the synthesizing the second portion of speech is performed on a remote device.
  - 12. The method of claim 4, further comprising:
    - estimating a time for delivery of initial results corresponding to the TTS request to a user; and
      
      determining that the time for delivery exceeds a threshold, and wherein the identifying the stored text sample is performed based at least in part on the determining.

13. A system comprising:
- at least one processor;
  
  a memory device including instructions operable to be executed by the at least one processor to perform a set of actions, configuring the at least one processor;
  
  to receive a text-to-speech (TTS) request including text for TTS processing;
  
  to identify a stored text sample corresponding to a first portion of the text;
  
  to retrieve speech unit identifiers associated with the stored text sample;
  
  to identify speech units associated with the speech unit identifiers;
  
  to retrieve audio corresponding to the speech units;
  
  to synthesize a first portion of speech based at least in part on the retrieved audio; and
  
  to synthesize a second portion of speech based at least in part on the text.
- View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 21)
- - 14. The system of claim 13, wherein the at least one processor is further configured to synthesize the second portion of speech from the text for TTS processing with at least the first portion of text removed.
  - 15. The system of claim 13, wherein the at least one processor is further configured:
    - to identify the stored text sample from a plurality of candidate stored text samples; and
      
      to determine a plurality of candidate stored text samples based at least in part on an application originating the TTS request, a popularity of the stored text sample, a user originating the TTS request, an intended user of results for the TTS request, a time associated with the TTS request, or a location for delivery of TTS results.
  - 16. The system of claim 13, wherein the at least one processor is further configured to identify the stored text sample from a plurality of candidate stored text samples, wherein the candidate stored text samples are selected based at least in part on a frequency of requests to synthesize text corresponding to the candidate stored text samples.
  - 17. The system of claim 13, wherein the at least one processor is further configured to allocate increased computing resources to processing of the first portion of the text than to a second portion of the text.
  - 18. The system of claim 17, wherein the at least one processor is further configured to allocate based at least in part an estimated latency for delivery of the first portion of speech to a user.
  - 19. The system of claim 13, wherein the at least one processor is further configured to synthesize the first portion of speech on a local device.
  - 20. The system of claim 13, wherein the at least one processor is further configured to synthesize the second portion of speech on a remote device.
  - 21. The system of claim 13, wherein the at least one processor is further configured:
    - to estimate a time for delivery of initial results corresponding to the TTS request to a user; and
      
      to determine that the time for delivery exceeds a threshold, and to identify the stored text sample based at least in part on the determination.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Original Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Inventors
Swietlinski, Krzysztof Franciszek, Kaszczuk, Michal Tadeusz, Osowski, Lukasz Maciej, Jedrzejczak, Jacek Jerzy
Primary Examiner(s)
Hudspeth, David
Assistant Examiner(s)
Nguyen, Timothy

Application Number

US13/951,825
Time in Patent Office

1,383 Days
Field of Search

704260
US Class Current
CPC Class Codes

G06F 40/10   Text processing natural lan...

G10L 13/02   Methods for producing synth...

G10L 13/04   Details of speech synthesis...

Reduced latency text-to-speech system

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

12 Citations

21 Claims

Specification

Use Cases

Quick Links

Others

Reduced latency text-to-speech system

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

12 Citations

21 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others