Interactive debugging and tuning of methods for CTTS voice building
First Claim
1. A system for debugging and tuning synthesized audio, comprising:
- means for receiving a user-supplied text with a visual user interface;
means for generating synthesized audio generated from concatenated phonetic units, the synthesized audio being a voice rendering of the user-supplied text;
means for displaying the waveform corresponding to synthesized audio generated from concatenated phonetic units;
means for displaying parameters corresponding to at least one of the phonetic units, the parameters including configuration parameters comprising at least one weight for adjusting at least one search cost function, the at least one weight comprising at least one of a pitch cost weight and a duration cost weight;
means for displaying an original recording containing a selected phonetic unit;
means for receiving an editing input from the user; and
means for adjusting the parameters in accordance with the editing input by adjusting and storing in a text-to-speech engine configuration file at least one configuration parameter, wherein adjusting includes repositioning a phonetic alignment marker;
means for highlighting in the display of the original recording at least one user-selected phonetic unit;
means for correcting elements of a text-to-speech segment dataset of parameters corresponding to a segment of the synthesized audio identified as being problematic;
means for generating a new synthesized waveform corresponding to one or more adjusted parameters; and
wherein the system continues to regenerate new synthesized waveforms until a desired synthesized output is generated.
8 Assignments
0 Petitions
Accused Products
Abstract
A method, a system, and an apparatus for identifying and correcting sources of problems in synthesized speech which is generated using a concatenative text-to-speech (CTTS) technique. The method can include the step of displaying a waveform corresponding to synthesized speech generated from concatenated phonetic units. The synthesized speech can be generated from text input received from a user. The method further can include the step of displaying parameters corresponding to at least one of the phonetic units. The method can include the step of displaying the original recordings containing selected phonetic units. An editing input can be received from the user and the parameters can be adjusted in accordance with the editing input.
-
Citations
7 Claims
-
1. A system for debugging and tuning synthesized audio, comprising:
-
means for receiving a user-supplied text with a visual user interface; means for generating synthesized audio generated from concatenated phonetic units, the synthesized audio being a voice rendering of the user-supplied text; means for displaying the waveform corresponding to synthesized audio generated from concatenated phonetic units; means for displaying parameters corresponding to at least one of the phonetic units, the parameters including configuration parameters comprising at least one weight for adjusting at least one search cost function, the at least one weight comprising at least one of a pitch cost weight and a duration cost weight; means for displaying an original recording containing a selected phonetic unit; means for receiving an editing input from the user; and
means for adjusting the parameters in accordance with the editing input by adjusting and storing in a text-to-speech engine configuration file at least one configuration parameter, wherein adjusting includes repositioning a phonetic alignment marker;means for highlighting in the display of the original recording at least one user-selected phonetic unit; means for correcting elements of a text-to-speech segment dataset of parameters corresponding to a segment of the synthesized audio identified as being problematic; means for generating a new synthesized waveform corresponding to one or more adjusted parameters; and wherein the system continues to regenerate new synthesized waveforms until a desired synthesized output is generated.
-
-
2. A machine-readable storage having stored thereon a computer program having a plurality of code sections, the code sections executable by a machine for causing the machine to perform the steps of:
-
(a) receiving a user-supplied text with a visual user interface; (b) generating synthesized audio generated from concatenated phonetic units, the synthesized audio being a voice rendering of the user-supplied text; (c) displaying a waveform corresponding to the synthesized audio generated from concatenated phonetic units; (d) displaying parameters corresponding to at least one of the phonetic units, the parameters including configuration parameters comprising at least one weight for adjusting at least one search cost function, the at least one weight comprising at least one of a pitch cost weight and a duration cost weight; (e) displaying an original recording containing a selected phonetic unit; (f) receiving an editing input from the user; (g) adjusting at least one configuration parameter in accordance with the editing input and storing the at least one configuration parameter in a text-to-speech engine configuration file, wherein adjusting includes repositioning a phonetic alignment marker; (h) highlighting in the display of the original recording at least one user-selected phonetic unit; (i) correcting elements of a text-to-speech segment dataset of parameters corresponding to a segment of the synthesized audio identified as being problematic; (j) generating a new synthesized waveform corresponding to one or more adjusted parameters; and (k) repeating steps (b)-(j) until a desired synthesized output is generated. - View Dependent Claims (3, 4, 5, 6, 7)
-
Specification