SPEECH RECOGNITION ASSISTED EVALUATION ON TEXT-TO-SPEECH PRONUNCIATION ISSUE DETECTION

US 20140257815A1
Filed: 03/05/2013
Published: 09/11/2014
Est. Priority Date: 03/05/2013
Status: Active Grant

First Claim

Patent Images

1. A method for determining pronunciation issues, comprising:

receiving text comprising sentences for a Text-To-Speech (TTS) component and a recording of the text that is used as a reference for the text;

receiving synthesized speech generated by the TTS component using the text as input to the TTS component;

evaluating results received by an evaluation performed at a text level by determining a similarity of the synthesized speech to the recording;

evaluating results obtained from a Speech Recognition (SR) component related to different inputs to the SR component comprising the synthesized speech and the recording; and

generating a list that includes a ranking of pronunciation issue candidates based on the evaluations.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Pronunciation issues for synthesized speech are automatically detected using human recordings as a reference within a Speech Recognition Assisted Evaluation (SRAE) framework including a Text-To-Speech flow and a Speech Recognition (SR) flow. A pronunciation issue detector evaluates results obtained at multiple levels of the TTS flow and the SR flow (e.g. phone, word, and signal level) by using the corresponding human recordings as the reference for the synthesized speech, and outputs possible pronunciation issues. A signal level may be used to determine similarities/differences between the recordings and the TTS output. A model level checker may provide results to the pronunciation issue detector to check the similarities of the TTS and the SR phone set including mapping relations. Results from a comparison of the SR output and the recordings may also be evaluation by the pronunciation issue detector. The pronunciation issue detector outputs a list that lists potential pronunciation issue candidates.

Citations

20 Claims

1. A method for determining pronunciation issues, comprising:
- receiving text comprising sentences for a Text-To-Speech (TTS) component and a recording of the text that is used as a reference for the text;
  
  receiving synthesized speech generated by the TTS component using the text as input to the TTS component;
  
  evaluating results received by an evaluation performed at a text level by determining a similarity of the synthesized speech to the recording;
  
  evaluating results obtained from a Speech Recognition (SR) component related to different inputs to the SR component comprising the synthesized speech and the recording; and
  
  generating a list that includes a ranking of pronunciation issue candidates based on the evaluations.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1, further comprising evaluating results from a signal level evaluation of phone sequences of the text using a phone sequence determined from the TTS component and an SR phone sequence of the recording.
  - 3. The method of claim 1, wherein the evaluation at the text level comprises performing evaluations for a word sequence and a phone sequence of each sentence within the text.
  - 4. The method of claim 1, wherein the evaluation at the text level comprises performing a similarity measurement of a phone sequence of each sentence in the text and a corresponding phone sequence of each sentence in the recording.
  - 5. The method of claim 1, further comprising performing a model level check for an acoustic model that determines a similarity of a TTS phone set and an SR phone set including determining a mapping relation between the TTS acoustic model and the SR acoustic model.
  - 6. The method of claim 1, wherein the evaluation performed at the text level comprises determining a similarity using an equation as defined by:
  - 7. The method of claim 1, wherein generating the list that includes the ranking of pronunciation issue candidates comprises filtering out mismatched words for judgment labels based on at least one of the evaluations using the synthesized speech and the recording.
  - 8. The method of claim 1, wherein the results received by the evaluation performed at the text level and the results obtained from the SR component are received by a pronunciation issue detector that is configured to perform the evaluations and to generate the list.

9. A computer-readable medium storing computer-executable instructions for determining pronunciation issues, comprising:
- receiving text comprising sentences for a Text-To-Speech (TTS) component and a recording of the text that is used as a reference for the text;
  
  receiving synthesized speech generated by the TTS component using the text as input to the TTS component;
  
  evaluating results received by an evaluation performed at a text level by determining a similarity of the synthesized speech to the recording;
  
  evaluating results obtained from a Speech Recognition (SR) component related to different inputs to the SR component comprising the synthesized speech and the recording;
  
  evaluating results from a signal level evaluation of the text and the recording; and
  
  generating a list that includes a ranking of pronunciation issue candidates based on the evaluations.
- View Dependent Claims (10, 11, 12, 13, 14)
- - 10. The computer-readable medium of claim 9, wherein the signal level evaluation of the text comprises evaluating a similarity of the recording of phone sequences of the text using a phone sequence determined from the TTS component and an SR phone sequence of the recording.
  - 11. The computer-readable medium of claim 9, wherein the evaluation at the text level comprises performing a similarity measurement of a phone sequence of each sentence in the text and a corresponding phone sequence of each sentence in the recording.
  - 12. The computer-readable medium of claim 9, further comprising performing a model level check for an acoustic model that determines a similarity of a TTS phone set and an SR phone set including determining a mapping relation between the TTS acoustic model and the SR acoustic model.
  - 13. The computer-readable medium of claim 9, wherein the evaluation performed at the text level comprises determining a similarity using an equation as defined by:
  - 14. The computer-readable medium of claim 9, wherein generating the list that includes the ranking of pronunciation issue candidates comprises filtering out mismatched words for judgment labels based on at least one of the evaluations using the synthesized speech and the recording.

15. A system for determining pronunciation issues, comprising:
- a processor and memory;
  
  an operating environment executing using the processor;
  
  text comprising sentences and a recording that corresponds to the text;
  
  a Text-To-Speech (TTS) component configured to generate synthesized speech using the text;
  
  a Speech Recognition (SR) component configured to recognize speech; and
  
  a pronunciation issue detector that is configured to perform actions comprising;
  
  receiving the synthesized speech generated by the TTS component;
  
  evaluating results received by an evaluation performed at a text level by determining a similarity of the synthesized speech to the recording;
  
  evaluating results obtained from the SR component related to different inputs to the SR component comprising the synthesized speech and the recording;
  
  evaluating results from a signal level evaluation of the text and the recording; and
  
  generating a list that includes a ranking of pronunciation issue candidates based on the evaluations.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The system of claim 15, wherein the signal level evaluation of the text comprises evaluating a similarity of the recording of phone sequences of the text using a phone sequence determined from the TTS component and an SR phone sequence of the recording.
  - 17. The system of claim 15, wherein the evaluation at the text level comprises performing a similarity measurement of a phone sequence of each sentence in the text and a corresponding phone sequence of each sentence in the recording.
  - 18. The system of claim 15, further comprising performing a model level check for an acoustic model that determines a similarity of a TTS phone set and an SR phone set including determining a mapping relation between the TTS acoustic model and the SR acoustic model.
  - 19. The system of claim 15, wherein the evaluation performed at the text level comprises determining a similarity using an equation as defined by:
  - 20. The system of claim 15, wherein generating the list that includes the ranking of pronunciation issue candidates comprises filtering out mismatched words for judgment labels based on at least one of the evaluations using the synthesized speech and the recording.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Zhao, Pei, He, Lei, Geng, Zhe, Yan, Bo, Leung, Yiu-Ming

Granted Patent

US 9,293,129 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/260
CPC Class Codes

G10L 13/08 Text analysis or generation...

G10L 13/086 Detection of language

SPEECH RECOGNITION ASSISTED EVALUATION ON TEXT-TO-SPEECH PRONUNCIATION ISSUE DETECTION

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

SPEECH RECOGNITION ASSISTED EVALUATION ON TEXT-TO-SPEECH PRONUNCIATION ISSUE DETECTION

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links