System and method for generating customized text-to-speech voices
First Claim
1. A method comprising:
- when a user is determined to be new, changing a front-end which converts text into linguistic tokens and tags the linguistic tokens with a prosody, a pronunciation, a speech act, and an emotion of the text, to yield a user-specific front-end;
in response to a user request from the user to generate a custom text-to-speech voice for a vehicle, the custom text-to-speech voice being associated with a domain, automatically;
collecting, via a processor, text data associated with the domain from a pre-existing text data source, to yield collected text data;
selecting, based on the user specific front end, synthesis speech units specific to the domain from a pre-existing inventory of synthesis speech units using the collected text data;
caching, via the processor, the synthesis speech units specific to the domain as an in-domain inventory of synthesis speech units; and
generating, via the processor, the custom text-to-speech voice for a specific task in the domain utilizing the in-domain inventory of synthesis speech units.
10 Assignments
0 Petitions
Accused Products
Abstract
A system and method are disclosed for generating customized text-to-speech voices for a particular application. The method comprises generating a custom text-to-speech voice by selecting a voice for generating a custom text-to-speech voice associated with a domain, collecting text data associated with the domain from a pre-existing text data source and using the collected text data, generating an in-domain inventory of synthesis speech units by selecting speech units appropriate to the domain via a search of a pre-existing inventory of synthesis speech units, or by recording the minimal inventory for a selected level of synthesis quality. The text-to-speech custom voice for the domain is generated utilizing the in-domain inventory of synthesis speech units. Active learning techniques may also be employed to identify problem phrases wherein only a few minutes of recorded data is necessary to deliver a high quality TTS custom voice.
24 Citations
20 Claims
-
1. A method comprising:
-
when a user is determined to be new, changing a front-end which converts text into linguistic tokens and tags the linguistic tokens with a prosody, a pronunciation, a speech act, and an emotion of the text, to yield a user-specific front-end; in response to a user request from the user to generate a custom text-to-speech voice for a vehicle, the custom text-to-speech voice being associated with a domain, automatically; collecting, via a processor, text data associated with the domain from a pre-existing text data source, to yield collected text data; selecting, based on the user specific front end, synthesis speech units specific to the domain from a pre-existing inventory of synthesis speech units using the collected text data; caching, via the processor, the synthesis speech units specific to the domain as an in-domain inventory of synthesis speech units; and generating, via the processor, the custom text-to-speech voice for a specific task in the domain utilizing the in-domain inventory of synthesis speech units. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
-
-
19. A system comprising:
-
a processor; and a computer readable storage medium having instructions stored which, when executed by the processor, cause the processor to perform operations comprising; when a user is determined to be new, changing a front-end which converts text into linguistic tokens and tags the linguistic tokens with a prosody, a pronunciation, a speech act, and an emotion of the text, to yield a user-specific front-end; in response to a user request from the user to generate a custom text-to-speech voice for a vehicle, the custom text-to-speech voice being associated with a domain, automatically; collecting, via a processor, text data associated with the domain from a pre-existing text data source, to yield collected text data; selecting, based on the user specific front end, synthesis speech units specific to the domain from a pre-existing inventory of synthesis speech units using the collected text data; caching, via the processor, the synthesis speech units specific to the domain as an in-domain inventory of synthesis speech units; and generating, via the processor, the custom text-to-speech voice for a specific task in the domain utilizing the in-domain inventory of synthesis speech units.
-
-
20. A computer-readable storage device having instructions stored which, when executed by a computing device, cause the computing device to perform operations comprising:
-
when a user is determined to be new, changing a front-end which converts text into linguistic tokens and tags the linguistic tokens with a prosody, a pronunciation, a speech act, and an emotion of the text, to yield a user-specific front-end; in response to a user request from the user to generate a custom text-to-speech voice for a vehicle, the custom text-to-speech voice being associated with a domain, automatically; collecting, via the computing device, text data associated with the domain from a pre-existing text data source, to yield collected text data; selecting, based on the user specific front end, synthesis speech units specific to the domain from a pre-existing inventory of synthesis speech units using the collected text data; caching, via the computing device, the synthesis speech units specific to the domain as an in-domain inventory of synthesis speech units; and generating, via the computing device, the custom text-to-speech voice for a specific task in the domain utilizing the in-domain inventory of synthesis speech units.
-
Specification