Systems and methods for selecting from multiple phonectic transcriptions for text-to-speech synthesis
First Claim
1. At least one computer readable storage device storing instructions that, when executed on at least one processor, perform a method of selecting a preferred phonetic transcription for use in text-to-speech synthesizing an input text, the method comprising:
- generating a plurality of phonetic transcriptions for at least one word of the input text to be synthesized, each of the plurality of phonetic transcriptions corresponding to a respective pronunciation that is of the at least one word as a whole, and is different from at least one other pronunciation corresponding to at least one other of the plurality of phonetic transcriptions;
computing at least one concatenative cost score for each one of the plurality of phonetic transcriptions to create a plurality of concatenative cost scores, the at least one concatenative cost score for each one of the plurality of phonetic transcriptions indicating at least one cost of concatenating selected speech segments from a plurality of stored speech segments associated with the respective one of the plurality of phonetic transcriptions; and
selecting the preferred phonetic transcription from the plurality of phonetic transcriptions for use in text-to-speech synthesizing the at least one word based, at least in part, on the at least one concatenative cost score associated with the preferred phonetic transcription.
8 Assignments
0 Petitions
Accused Products
Abstract
A system and method for generating synthetic speech, which operates in a computer implemented Text-To-Speech system. The system comprises at least a speaker database that has been previously created from user recordings, a Front-End system to receive an input text and a Text-To-Speech engine. The Front-End system generates multiple phonetic transcriptions for each word of the input text, and the TTS engine uses a cost function to select which phonetic transcription is the more appropriate for searching the speech segments within the speaker database to be concatenated and synthesized.
329 Citations
19 Claims
-
1. At least one computer readable storage device storing instructions that, when executed on at least one processor, perform a method of selecting a preferred phonetic transcription for use in text-to-speech synthesizing an input text, the method comprising:
-
generating a plurality of phonetic transcriptions for at least one word of the input text to be synthesized, each of the plurality of phonetic transcriptions corresponding to a respective pronunciation that is of the at least one word as a whole, and is different from at least one other pronunciation corresponding to at least one other of the plurality of phonetic transcriptions; computing at least one concatenative cost score for each one of the plurality of phonetic transcriptions to create a plurality of concatenative cost scores, the at least one concatenative cost score for each one of the plurality of phonetic transcriptions indicating at least one cost of concatenating selected speech segments from a plurality of stored speech segments associated with the respective one of the plurality of phonetic transcriptions; and selecting the preferred phonetic transcription from the plurality of phonetic transcriptions for use in text-to-speech synthesizing the at least one word based, at least in part, on the at least one concatenative cost score associated with the preferred phonetic transcription. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A system for selecting a preferred phonetic transcription for use in synthesizing speech from an input text, the system comprising:
-
at least one storage medium storing a plurality of speech segments that may be concatenated to synthesize speech; at least one input to receive the input text; and at least one computer coupled to the at least one input and capable of accessing the at least one storage medium, the at least one computer programmed to; generate a plurality of phonetic transcriptions for at least one word of the input text to be synthesized, each of the plurality of phonetic transcriptions corresponding to a respective pronunciation that is of the at least one word as a whole, and is different from at least one other pronunciation corresponding to at least one other of the plurality of phonetic transcriptions; compute at least one concatenative cost score for each one of the plurality of phonetic transcriptions to create a plurality of concatenative cost scores, the at least one concatenative cost score for each one of the plurality of phonetic transcriptions indicating at least one cost of concatenating selected speech segments from the stored plurality of speech segments associated with the respective one of the plurality of phonetic transcriptions; and select the preferred phonetic transcription from the plurality of phonetic transcriptions for use in text-to-speech synthesizing the at least one word based, at least in part, on the at least one concatenative cost score associated with the preferred phonetic transcription. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18, 19)
-
Specification