Speech recognition and text-to-speech learning system

US 10,089,974 B2
Filed: 03/31/2016
Issued: 10/02/2018
Est. Priority Date: 03/31/2016
Status: Active Grant

First Claim

Patent Images

1. A text-to-speech learning system, the system comprising:

at least one processor; and

at least one storage device, operatively connected to the at least one processor and storing;

at least one training corpus comprising a plurality of training pairs that represent a varied vocabulary from one or more speakers, each training pair comprising a speech input and a text input corresponding to the speech input; and

instructions that, when executed by the at least processor, cause the at least one processor to perform a method for generating a pronunciation sequence conversion model, the method comprising;

for each training pair;

selecting a training pair from the at least one training corpus;

generating a first pronunciation sequence from the speech input of the training pair; and

generating a second pronunciation sequence from the text input of the training pair;

determining a pronunciation sequence difference between the first pronunciation sequence and the second pronunciation sequence; and

generating a pronunciation sequence conversion model based on a plurality of pronunciation sequence differences, wherein the pronunciation sequence conversion model is configured to synthesize speech by converting a pronunciation sequence generated in response to a received speech input to a target pronunciation sequence that more closely matches a pronunciation sequence extracted from the received speech input.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An example text-to-speech learning system performs a method for generating a pronunciation sequence conversion model. The method includes generating a first pronunciation sequence from a speech input of a training pair and generating a second pronunciation sequence from a text input of the training pair. The method also includes determining a pronunciation sequence difference between the first pronunciation sequence and the second pronunciation sequence; and generating a pronunciation sequence conversion model based on the pronunciation sequence difference. An example speech recognition learning system performs a method for generating a pronunciation sequence conversion model. The method includes extracting an audio signal vector from a speech input and applying an audio signal conversion model to the audio signal vector to generate a converted audio signal vector. The method also includes adapting an acoustic model based on the converted audio signal vector to generate an adapted acoustic model.

18 Citations

View as Search Results

19 Claims

1. A text-to-speech learning system, the system comprising:
- at least one processor; and
  
  at least one storage device, operatively connected to the at least one processor and storing;
  
  at least one training corpus comprising a plurality of training pairs that represent a varied vocabulary from one or more speakers, each training pair comprising a speech input and a text input corresponding to the speech input; and
  
  instructions that, when executed by the at least processor, cause the at least one processor to perform a method for generating a pronunciation sequence conversion model, the method comprising;
  
  for each training pair;
  
  selecting a training pair from the at least one training corpus;
  
  generating a first pronunciation sequence from the speech input of the training pair; and
  
  generating a second pronunciation sequence from the text input of the training pair;
  
  determining a pronunciation sequence difference between the first pronunciation sequence and the second pronunciation sequence; and
  
  generating a pronunciation sequence conversion model based on a plurality of pronunciation sequence differences, wherein the pronunciation sequence conversion model is configured to synthesize speech by converting a pronunciation sequence generated in response to a received speech input to a target pronunciation sequence that more closely matches a pronunciation sequence extracted from the received speech input.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The text-to-speech learning system of claim 1, wherein the method further comprises extracting an audio signal vector from the speech input of the training pair, and wherein the first pronunciation sequence is generated based on the extracted audio signal vector.
  - 3. The text-to-speech learning system of claim 1, wherein the pronunciation sequence conversion model comprises a recursive neural network.
  - 4. The text-to-speech learning system of claim 1, wherein determining a pronunciation sequence difference between the first pronunciation sequence and the second pronunciation sequence comprises aligning the first pronunciation sequence with the second pronunciation sequence.
  - 5. The text-to-speech learning system of claim 1, wherein the first pronunciation sequence comprises a sequence of pronunciation signals.
  - 6. The text-to-speech learning system of claim 1, wherein:
    - the at least one training corpus comprises a text-to-speech training corpus comprising training pairs from a particular speaker and a speech-recognition training corpus comprising training pairs from different speakers; and
      
      the plurality of pronunciation sequence differences comprises at least one pronunciation sequence difference generated from a training pair selected from the text-to-speech training corpus and at least one pronunciation sequence difference generated from a training pair selected from the speech-recognition training corpus.
  - 7. The text-to-speech learning system of claim 1, wherein a pronunciation sequence generator model is configured to be used by a text-to-speech system to synthesize speech.

8. A speech recognition learning system, the system comprising:
- at least one processor; and
  
  at least one storage device, operatively connected to the at least one processor and storing;
  
  at least one training corpus comprising a plurality of training pairs that represent a varied vocabulary from one or more speakers, each training pair comprising a speech input and a text input corresponding to the speech input; and
  
  instructions that, when executed by the at least processor, cause the at least one processor to perform a method for generating a pronunciation sequence conversion model, the method comprising;
  
  for each training pair,receiving a training pair from the at least one training corpus;
  
  extracting an audio signal vector from the speech input of the training pair; and
  
  applying an audio signal conversion model to the audio signal vector to generate a converted audio signal vector; and
  
  adapting an acoustic model based on a plurality of converted audio signal vectors to generate an adapted acoustic model, wherein the adapted acoustic model is used to generate a pronunciation sequence during a speech recognition operation.
- View Dependent Claims (9, 10, 11, 12, 13, 14, 15, 16, 17)
- - 9. The speech recognition learning system of claim 8, wherein the adapted acoustic model is configured to be used by a speech-recognition system to recognize speech from a user.
  - 10. The speech recognition learning system of claim 8, wherein the method further comprises generating an audio vector conversion model based on the plurality of training pairs.
  - 11. The speech recognition learning system of claim 10, wherein the method further comprises comparing an audio signal vector extracted from a speech input of a respective training pair of the plurality of training pairs to a second audio signal vector generated from the text input of the respective training pair.
  - 12. The speech recognition learning system of claim 11, wherein the method further comprises determining a difference between the extracted audio signal vector and the second audio signal vector.
  - 13. The speech recognition learning system of claim 12, wherein determining the difference between the extracted audio signal vector and the second audio signal vector comprises aligning the extracted audio signal vector with the second audio signal vector.
  - 14. The speech recognition learning system of claim 11, wherein the second audio signal vector is generated by extracting an audio signal vector from synthesized speech based on the text input of the respective training pair.
  - 15. The speech recognition learning system of claim 10, wherein the audio signal vector conversion model is configured to be used by a speech recognition system to recognize speech from a user.
  - 16. The speech recognition learning system of claim 8, wherein the adapted acoustic model is generated by adapting a plurality of extracted audio signal vectors from a plurality of speech inputs.
  - 17. The speech recognition learning system of claim 16, wherein:
    - the at least one training corpus comprises a text-to-speech training corpus comprising training pairs from a particular speaker and a speech-recognition training corpus comprising training pairs from different speakers; and
      
      the plurality of speech inputs comprise at least one speech input from a training pair in the text-to-speech training corpus and at least one speech input from a training pair in the speech-recognition training corpus.

18. A method for generating a text-to-speech model and a speech-recognition model, the method comprising:
- generating a pronunciation sequence conversion model based on a plurality of pronunciation sequence differences, wherein each of the pronunciation sequence differences is associated with a training pair from a plurality of training pairs stored in at least one training corpus and representing a varied vocabulary from one or more speakers, and each of the pronunciation sequence differences is generated by comparing a first pronunciation sequence generated from a speech input of a training pair associated with the pronunciation sequence difference to a second pronunciation sequence generated from a text input of the training pair associated with the pronunciation sequence difference; and
  
  adapting an acoustic model based on a plurality of converted audio signal vectors, wherein each of the converted audio signal vectors is associated with a speech input from a plurality of speech inputs and the each of the converted audio signal vectors is generated by extracting an audio signal vector from the speech input associated with the converted audio signal vector and applying an audio signal vector conversion model to the extracted audio signal vector.
- View Dependent Claims (19)
- - 19. The method of claim 18 further comprising:
    - generating an audio vector conversion model based on a plurality of audio signal vector differences, wherein each of the audio signal vectors is associated with a training pair from the plurality of training pairs and each of the audio signal vector differences is generated by comparing an audio signal vector extracted from the speech input of the training pair associated with the audio signal vector difference and a second audio signal vector generated from the text input of the training pair associated with the audio signal vector difference.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Inventors
Zhao, Pei, Yao, Kaisheng, Leung, Max, Yan, Bo
Primary Examiner(s)
Riley, Marcus T

Application Number

US15/087,696
Publication Number

US 20170287465A1
Time in Patent Office

915 Days
Field of Search

None
US Class Current
CPC Class Codes

G10L 13/08   Text analysis or generation...

G10L 13/086   Detection of language

G10L 13/10   Prosody rules derived from ...

G10L 15/063   Training

G10L 15/07   to the speaker

Speech recognition and text-to-speech learning system

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

18 Citations

19 Claims

Specification

Use Cases

Quick Links

Others

Speech recognition and text-to-speech learning system

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

18 Citations

19 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others