System and method for correcting errors when generating a TTS voice
First Claim
1. A method of enabling human workers to find errors when developing a text-to-speech (TTS) voice, the method comprising:
- presenting via a processor a graphical user interface, wherein after a first pass of automatic speech recognition (ASR) of a speech corpus is complete, the interface presents to a worker a graphical representation of an alignment of the ASR results, associated words and phonemes and the audio;
color-coding via the processor each word based on a composition of the color-coding associated with each phoneme;
receiving via the processor a graphical input from the worker associated with a selection of a word or phoneme; and
presenting via the processor the audio associated with the selected word or phoneme.
10 Assignments
0 Petitions
Accused Products
Abstract
Disclosed herein are various innovations associated with a toolkit used for generating a TTS voice for use in a spoken dialog system. The inventions in each case may be in the form of the system, a computer-readable medium or a method for generating the TTS voice. An embodiment of the invention relates to a method of enabling human workers to find errors when developing a text-to-speech (TTS) voice. The method comprises presenting a graphical user interface wherein after a first pass of automatic speech recognition (ASR) of a speech corpus is complete, the interface presents to a worker a graphical representation of an alignment of the ASR results, associated words and phonemes and the audio, receiving a graphical input from the worker associated with a selection of a word or phoneme and presenting the audio associated with the selected word or phoneme.
42 Citations
18 Claims
-
1. A method of enabling human workers to find errors when developing a text-to-speech (TTS) voice, the method comprising:
-
presenting via a processor a graphical user interface, wherein after a first pass of automatic speech recognition (ASR) of a speech corpus is complete, the interface presents to a worker a graphical representation of an alignment of the ASR results, associated words and phonemes and the audio; color-coding via the processor each word based on a composition of the color-coding associated with each phoneme; receiving via the processor a graphical input from the worker associated with a selection of a word or phoneme; and presenting via the processor the audio associated with the selected word or phoneme. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A tangible computer-readable storage medium storing instructions for controlling a computing device to enable human workers to find errors when developing a text-to-speech (TTS) voice, the instructions comprising:
-
presenting a graphical user interface wherein after a first pass of automatic speech recognition (ASR) of a speech corpus is complete, the interface presents to a worker a graphical representation of an alignment of the ASR results, associated words and phonemes and the audio; color-coding each word based on a composition of the color-coding associated with each phoneme; receiving a graphical input from the worker associated with a selection of a word or phoneme; and presenting the audio associated with the selected word or phoneme. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
-
-
17. A computing device for enabling human workers to find errors when developing a text-to-speech (TTS) voice, the computing device comprising:
-
a processor; a module configured to control the processor to present a graphical user interface wherein after a first pass of automatic speech recognition (ASR) of a speech corpus is complete, the interface presents to a worker a graphical representation of an alignment of the ASR results, associated words and phonemes and the audio; a module configured to control the processor to color-code each word based on a composition of the color-coding associated with each phoneme; a module configured to control the processor to receive a graphical input from the worker associated with a selection of a word or phoneme; and a module configured to control the processor to present the audio associated with the selected word or phoneme. - View Dependent Claims (18)
-
Specification