System and method for distributed voice models across cloud and device for embedded text-to-speech
First Claim
1. A method comprising:
- identifying, via a processor, a speech synthesis context;
determining, based on a local cache of text-to-speech units for a text-to-speech voice and based on the speech synthesis context, additional text-to-speech units which are not in the local cache;
requesting from a server the additional text-to-speech units;
receiving the additional text-to-speech units from the server; and
synthesizing speech using the text-to-speech units and the additional text-to-speech units.
1 Assignment
0 Petitions
Accused Products
Abstract
Systems, methods, and computer-readable storage media for intelligent caching of concatenative speech units for use in speech synthesis. A system configured to practice the method can identify a speech synthesis context, and determine, based on a local cache of text-to-speech units for a text-to-speech voice and based on the speech synthesis context, additional text-to-speech units which are not in the local cache. The system can request from a server the additional text-to-speech units, and store the additional text-to-speech units in the local cache. The system can then synthesize speech using the text-to-speech units and the additional text-to-speech units in the local cache. The system can prune the cache as the context changes, based on availability of local storage, or after synthesizing the speech. The local cache can store a core set of text-to-speech units associated with the text-to-speech voice that cannot be pruned from the local cache.
-
Citations
20 Claims
-
1. A method comprising:
-
identifying, via a processor, a speech synthesis context; determining, based on a local cache of text-to-speech units for a text-to-speech voice and based on the speech synthesis context, additional text-to-speech units which are not in the local cache; requesting from a server the additional text-to-speech units; receiving the additional text-to-speech units from the server; and synthesizing speech using the text-to-speech units and the additional text-to-speech units. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A system comprising:
-
a processor; and a computer-readable medium having instructions which, when executed by the processor, cause the processor to perform operations comprising; identifying a speech synthesis context; determining, based on a local cache of text-to-speech units for a text-to-speech voice and based on the speech synthesis context, additional text-to-speech units which are not in the local cache; requesting from a server the additional text-to-speech units; storing the additional text-to-speech units in the local cache; and synthesizing speech using the text-to-speech units and the additional text-to-speech units in the local cache. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. A non-transitory computer-readable storage medium storing instructions which cause a processor to perform operations comprising:
-
identifying, via a processor, a speech synthesis context; determining, based on a local cache of text-to-speech units for a text-to-speech voice and based on the speech synthesis context, additional text-to-speech units which are not in the local cache; requesting from a server the additional text-to-speech units; storing, in a storage device, the additional text-to-speech units in the local cache; and synthesizing speech using the text-to-speech units and the additional text-to- speech units in the local cache. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification