Speech Recognition and Text-to-Speech Learning System

US 20170287465A1
Filed: 03/31/2016
Published: 10/05/2017
Est. Priority Date: 03/31/2016
Status: Active Grant

First Claim

Patent Images

1. A text-to-speech learning system, the system comprising:

at least one processor; and

memory, operatively connected to the at least one processor and storing instructions that, when executed by the at least processor, cause the at least one processor to perform a method for generating a pronunciation sequence conversion model, the method comprising;

generating a first pronunciation sequence from a speech input of a training pair;

generating a second pronunciation sequence from a text input of the training pair;

determining a pronunciation sequence difference between the first pronunciation sequence and the second pronunciation sequence; and

generating a pronunciation sequence conversion model based on the pronunciation sequence difference.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An example text-to-speech learning system performs a method for generating a pronunciation sequence conversion model. The method includes generating a first pronunciation sequence from a speech input of a training pair and generating a second pronunciation sequence from a text input of the training pair. The method also includes determining a pronunciation sequence difference between the first pronunciation sequence and the second pronunciation sequence; and generating a pronunciation sequence conversion model based on the pronunciation sequence difference. An example speech recognition learning system performs a method for generating a pronunciation sequence conversion model. The method includes extracting an audio signal vector from a speech input and applying an audio signal conversion model to the audio signal vector to generate a converted audio signal vector. The method also includes adapting an acoustic model based on the converted audio signal vector to generate an adapted acoustic model.

Citations

20 Claims

1. A text-to-speech learning system, the system comprising:
- at least one processor; and
  
  memory, operatively connected to the at least one processor and storing instructions that, when executed by the at least processor, cause the at least one processor to perform a method for generating a pronunciation sequence conversion model, the method comprising;
  
  generating a first pronunciation sequence from a speech input of a training pair;
  
  generating a second pronunciation sequence from a text input of the training pair;
  
  determining a pronunciation sequence difference between the first pronunciation sequence and the second pronunciation sequence; and
  
  generating a pronunciation sequence conversion model based on the pronunciation sequence difference.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The text-to-speech learning system of claim 1, wherein the method further comprises extracting an audio signal vector from the speech input of the training pair, and wherein the first pronunciation sequence is generated based on the extracted audio signal vector.
  - 3. The text-to-speech learning system of claim 1, wherein the pronunciation sequence conversion model comprises a recursive neural network.
  - 4. The text-to-speech learning system of claim 1, wherein determining a pronunciation sequence difference between the first pronunciation sequence and the second pronunciation sequence comprises aligning the first pronunciation sequence with the second pronunciation sequence.
  - 5. The text-to-speech learning system of claim 1, wherein the first pronunciation sequence comprises a sequence of pronunciation signals.
  - 6. The text-to-speech learning system of claim 1, wherein the pronunciation sequence conversion model is generated based on a plurality of pronunciation sequences from a plurality of training pairs, the plurality of pronunciation sequence differences including the pronunciation sequence difference.
  - 7. The text-to-speech learning system of claim 6, wherein the plurality of pronunciation sequence differences comprises at least one pronunciation sequence difference generated from a training pair selected from a text-to-speech training corpus and at least one pronunciation sequence difference generated from a training pair selected from a speech-recognition training corpus.
  - 8. The text-to-speech learning system of claim 1, wherein the pronunciation sequence generator model is configured to be used by a text-to-speech system to synthesize speech.

9. A speech recognition learning system, the system comprising:
- at least one processor; and
  
  memory, operatively connected to the at least one processor and storing instructions that, when executed by the at least processor, cause the at least one processor to perform a method for generating a pronunciation sequence conversion model, the method comprising;
  
  extracting an audio signal vector from a speech input;
  
  applying an audio signal conversion model to the audio signal vector to generate a converted audio signal vector; and
  
  adapting an acoustic model based on the converted audio signal vector to generate an adapted acoustic model.
- View Dependent Claims (10, 11, 12, 13, 14, 15, 16, 17, 18)
- - 10. The speech recognition learning system of claim 9, wherein the adapted acoustic model is configured to be used by a speech-recognition system to recognize speech from a user.
  - 11. The speech recognition learning system of claim 9, wherein the method further comprises generating an audio vector conversion model based on a plurality of training pairs.
  - 12. The speech recognition learning system of claim 11, wherein the method further comprises comparing an audio signal vector extracted from a speech input of a training pair of the plurality of training pairs to a second audio signal vector generated from the text input of the training pair.
  - 13. The speech recognition learning system of claim 12, wherein the method further comprises determining a difference between the extracted audio signal vector and the second audio signal vector.
  - 14. The speech recognition learning system of claim 13, wherein determining the difference between the extracted audio signal vector and the second audio signal vector comprises aligning the extracted audio signal vector with the second audio signal vector.
  - 15. The speech recognition learning system of claim 12, wherein the second audio signal vector is generated by extracting an audio signal vector from synthesized speech based on the text input of the training pair.
  - 16. The speech recognition learning system of claim 11, wherein the audio signal vector conversion model is configured to be used by a speech recognition system to recognize speech from a user.
  - 17. The speech recognition learning system of claim 9, wherein the adapted acoustic model is generated by adapting a plurality of extracted audio signal vectors from a plurality of speech inputs, the plurality of extracted audio signal vectors including the audio signal vector.
  - 18. The speech recognition learning system of claim 17, wherein the plurality of speech inputs comprise at least one speech input from a training pair in a text-to-speech training corpus and at least one speech input from a training pair in a speech-recognition training corpus.

19. A method for generating a text-to-speech model and a speech-recognition model, the method comprising:
- generating a pronunciation sequence conversion model based on a plurality of pronunciation sequence differences, wherein each of the pronunciation sequence differences is associated with a training pair from a plurality of training pairs and each of the pronunciation sequence differences is generated by comparing a first pronunciation sequence generated from a speech input of a training pair associated with the pronunciation sequence difference to a second pronunciation sequence generated from the text input of the training pair associated with the pronunciation sequence difference; and
  
  adapting an acoustic model based on a plurality of converted audio signal vectors, wherein each of the converted audio signal vectors is associated with a speech input from a plurality of speech inputs and the each of the converted audio signal vectors is generated by extracting an audio signal vector from the speech input associated with the converted audio signal vector and applying an audio signal vector conversion model to the extracted audio signal vector.
- View Dependent Claims (20)
- - 20. The method of claim 19 further comprising:
    - generating an audio vector conversion model based on a plurality of audio signal vector differences, wherein each of the audio signal vectors is associated with a training pair from the plurality of training pairs and each of the audio signal vector differences is generated by comparing an audio signal vector extracted from the speech input of the training pair associated with the audio signal vector difference and a second audio signal vector generated from the text input of the training pair associated with the audio signal vector difference.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Inventors
Zhao, Pei, Yao, Kaisheng, Leung, Max, Yan, Bo

Granted Patent

US 10,089,974 B2
Time in Patent Office

Days
Field of Search
US Class Current
CPC Class Codes

G10L 13/08   Text analysis or generation...

G10L 13/086   Detection of language

G10L 13/10   Prosody rules derived from ...

G10L 15/063   Training

G10L 15/07   to the speaker

Speech Recognition and Text-to-Speech Learning System

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Speech Recognition and Text-to-Speech Learning System

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links