SPEECH RECOGNITION ASSISTED EVALUATION ON TEXT-TO-SPEECH PRONUNCIATION ISSUE DETECTION
First Claim
1. A method for determining pronunciation issues, comprising:
- receiving text comprising sentences for a Text-To-Speech (TTS) component and a recording of the text that is used as a reference for the text;
receiving synthesized speech generated by the TTS component using the text as input to the TTS component;
evaluating results received by an evaluation performed at a text level by determining a similarity of the synthesized speech to the recording;
evaluating results obtained from a Speech Recognition (SR) component related to different inputs to the SR component comprising the synthesized speech and the recording; and
generating a list that includes a ranking of pronunciation issue candidates based on the evaluations.
3 Assignments
0 Petitions
Accused Products
Abstract
Pronunciation issues for synthesized speech are automatically detected using human recordings as a reference within a Speech Recognition Assisted Evaluation (SRAE) framework including a Text-To-Speech flow and a Speech Recognition (SR) flow. A pronunciation issue detector evaluates results obtained at multiple levels of the TTS flow and the SR flow (e.g. phone, word, and signal level) by using the corresponding human recordings as the reference for the synthesized speech, and outputs possible pronunciation issues. A signal level may be used to determine similarities/differences between the recordings and the TTS output. A model level checker may provide results to the pronunciation issue detector to check the similarities of the TTS and the SR phone set including mapping relations. Results from a comparison of the SR output and the recordings may also be evaluation by the pronunciation issue detector. The pronunciation issue detector outputs a list that lists potential pronunciation issue candidates.
-
Citations
20 Claims
-
1. A method for determining pronunciation issues, comprising:
-
receiving text comprising sentences for a Text-To-Speech (TTS) component and a recording of the text that is used as a reference for the text; receiving synthesized speech generated by the TTS component using the text as input to the TTS component; evaluating results received by an evaluation performed at a text level by determining a similarity of the synthesized speech to the recording; evaluating results obtained from a Speech Recognition (SR) component related to different inputs to the SR component comprising the synthesized speech and the recording; and generating a list that includes a ranking of pronunciation issue candidates based on the evaluations. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A computer-readable medium storing computer-executable instructions for determining pronunciation issues, comprising:
-
receiving text comprising sentences for a Text-To-Speech (TTS) component and a recording of the text that is used as a reference for the text; receiving synthesized speech generated by the TTS component using the text as input to the TTS component; evaluating results received by an evaluation performed at a text level by determining a similarity of the synthesized speech to the recording; evaluating results obtained from a Speech Recognition (SR) component related to different inputs to the SR component comprising the synthesized speech and the recording; evaluating results from a signal level evaluation of the text and the recording; and generating a list that includes a ranking of pronunciation issue candidates based on the evaluations. - View Dependent Claims (10, 11, 12, 13, 14)
-
-
15. A system for determining pronunciation issues, comprising:
-
a processor and memory; an operating environment executing using the processor; text comprising sentences and a recording that corresponds to the text; a Text-To-Speech (TTS) component configured to generate synthesized speech using the text; a Speech Recognition (SR) component configured to recognize speech; and a pronunciation issue detector that is configured to perform actions comprising; receiving the synthesized speech generated by the TTS component; evaluating results received by an evaluation performed at a text level by determining a similarity of the synthesized speech to the recording; evaluating results obtained from the SR component related to different inputs to the SR component comprising the synthesized speech and the recording; evaluating results from a signal level evaluation of the text and the recording; and generating a list that includes a ranking of pronunciation issue candidates based on the evaluations. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification