System and method for distributed voice models across cloud and device for embedded text-to-speech

US 9,218,804 B2
Filed: 09/12/2013
Issued: 12/22/2015
Est. Priority Date: 09/12/2013
Status: Expired due to Fees

First Claim

Patent Images

1. A method comprising:

identifying, via a processor, a speech synthesis context;

determining, based on a local cache of text-to-speech units for a text-to-speech voice and based on the speech synthesis context, additional text-to-speech units which are not in the local cache;

requesting from a server the additional text-to-speech units;

receiving the additional text-to-speech units from the server; and

synthesizing speech using the text-to-speech units and the additional text-to-speech units.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems, methods, and computer-readable storage media for intelligent caching of concatenative speech units for use in speech synthesis. A system configured to practice the method can identify a speech synthesis context, and determine, based on a local cache of text-to-speech units for a text-to-speech voice and based on the speech synthesis context, additional text-to-speech units which are not in the local cache. The system can request from a server the additional text-to-speech units, and store the additional text-to-speech units in the local cache. The system can then synthesize speech using the text-to-speech units and the additional text-to-speech units in the local cache. The system can prune the cache as the context changes, based on availability of local storage, or after synthesizing the speech. The local cache can store a core set of text-to-speech units associated with the text-to-speech voice that cannot be pruned from the local cache.

Citations

20 Claims

1. A method comprising:
- identifying, via a processor, a speech synthesis context;
  
  determining, based on a local cache of text-to-speech units for a text-to-speech voice and based on the speech synthesis context, additional text-to-speech units which are not in the local cache;
  
  requesting from a server the additional text-to-speech units;
  
  receiving the additional text-to-speech units from the server; and
  
  synthesizing speech using the text-to-speech units and the additional text-to-speech units.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1, further comprising:
    - storing the additional text-to-speech units in the local cache; and
      
      pruning the local cache after synthesizing the speech.
  - 3. The method of claim 2, wherein the local cache stores a core set of text-to-speech units associated with the text-to-speech voice that cannot be pruned from the local cache.
  - 4. The method of claim 1, wherein identifying the speech synthesis context comprises:
    - receiving a request to synthesize speech.
  - 5. The method of claim 1, further comprising:
    - determining parameters relating to speech synthesis; and
      
      determining, based on the parameters, how many additional text-to-speech units to request.
  - 6. The method of claim 1, wherein the local cache of text-to-speech units comprises speech snippets for use in concatenative synthesis.
  - 7. The method of claim 1, further comprising:
    - beginning to synthesize speech using only the local cache of text-to-speech units before receiving the additional text-to-speech units; and
      
      continuing to synthesize speech using the local cache of text-to-speech units and the additional text-to-speech units as the additional text-to-speech units are received and stored in the local cache.

8. A system comprising:
- a processor; and
  
  a computer-readable medium having instructions which, when executed by the processor, cause the processor to perform operations comprising;
  
  identifying a speech synthesis context;
  
  determining, based on a local cache of text-to-speech units for a text-to-speech voice and based on the speech synthesis context, additional text-to-speech units which are not in the local cache;
  
  requesting from a server the additional text-to-speech units;
  
  storing the additional text-to-speech units in the local cache; and
  
  synthesizing speech using the text-to-speech units and the additional text-to-speech units in the local cache.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The system of claim 8, wherein the computer-readable medium stores further instructions which result in further operations comprising:
    - pruning the local cache after synthesizing the speech.
  - 10. The system of claim 9, wherein the local cache stores a core set of text-to-speech units associated with the text-to-speech voice that cannot be pruned from the local cache.
  - 11. The system of claim 8, wherein identifying the speech synthesis context comprises:
    - receiving a request to synthesize speech.
  - 12. The system of claim 8, wherein the computer-readable medium stores further instructions which result in further operations comprising:
    - determining parameters relating to speech synthesis; and
      
      determining, based on the parameters, how many additional text-to-speech units to request.
  - 13. The system of claim 8, wherein the local cache of text-to-speech units comprises speech snippets for use in concatenative synthesis.
  - 14. The system of claim 8, wherein the computer-readable medium stores further instructions which result in further operations comprising:
    - beginning to synthesize speech using only the local cache of text-to-speech units before receiving the additional text-to-speech units; and
      
      continuing to synthesize speech using the local cache of text-to-speech units and the additional text-to-speech units as the additional text-to-speech units are received and stored in the local cache.

15. A non-transitory computer-readable storage medium storing instructions which cause a processor to perform operations comprising:
- identifying, via a processor, a speech synthesis context;
  
  determining, based on a local cache of text-to-speech units for a text-to-speech voice and based on the speech synthesis context, additional text-to-speech units which are not in the local cache;
  
  requesting from a server the additional text-to-speech units;
  
  storing, in a storage device, the additional text-to-speech units in the local cache; and
  
  synthesizing speech using the text-to-speech units and the additional text-to- speech units in the local cache.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The computer-readable storage medium of claim 15, wherein further instructions are stored which caused the processor to perform further operations comprising:
    - pruning the local cache after synthesizing the speech.
  - 17. The computer-readable storage medium of claim 16, wherein the local cache stores a core set of text-to-speech units associated with the text-to-speech voice that cannot be pruned from the local cache.
  - 18. The computer-readable storage medium of claim 15, wherein identifying the speech synthesis context comprises:
    - receiving a request to synthesize speech.
  - 19. The computer-readable storage medium of claim 15, wherein further instructions are stored which caused the processor to perform further operations comprising:
    - determining parameters relating to speech synthesis; and
      
      determining, based on the parameters, how many additional text-to-speech units to request.
  - 20. The computer-readable storage medium of claim 15, wherein the local cache of text-to-speech units comprises speech snippets for use in concatenative synthesis.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
AT&T Intellectual Property I LP (AT&T, Inc.)
Original Assignee
AT&T Intellectual Property I LP (AT&T, Inc.)
Inventors
Stern, Benjamin J., Beutnagel, Mark Charles, Conkie, Alistair D., Schroeter, Horst J., Stent, Amanda Joy
Primary Examiner(s)
MCFADDEN, SUSAN IRIS

Application Number

US14/025,344
Publication Number

US 20150073805A1
Time in Patent Office

831 Days
Field of Search

704/260
US Class Current

1/1
CPC Class Codes

G10L 13/04   Details of speech synthesis...

G10L 13/047   Architecture of speech synt...

G10L 13/07   Concatenation rules

System and method for distributed voice models across cloud and device for embedded text-to-speech

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for distributed voice models across cloud and device for embedded text-to-speech

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links