Speech recognition and text-to-speech learning system
First Claim
1. A text-to-speech learning system, the system comprising:
- at least one processor; and
at least one storage device, operatively connected to the at least one processor and storing;
at least one training corpus comprising a plurality of training pairs that represent a varied vocabulary from one or more speakers, each training pair comprising a speech input and a text input corresponding to the speech input; and
instructions that, when executed by the at least processor, cause the at least one processor to perform a method for generating a pronunciation sequence conversion model, the method comprising;
for each training pair;
selecting a training pair from the at least one training corpus;
generating a first pronunciation sequence from the speech input of the training pair; and
generating a second pronunciation sequence from the text input of the training pair;
determining a pronunciation sequence difference between the first pronunciation sequence and the second pronunciation sequence; and
generating a pronunciation sequence conversion model based on a plurality of pronunciation sequence differences, wherein the pronunciation sequence conversion model is configured to synthesize speech by converting a pronunciation sequence generated in response to a received speech input to a target pronunciation sequence that more closely matches a pronunciation sequence extracted from the received speech input.
1 Assignment
0 Petitions
Accused Products
Abstract
An example text-to-speech learning system performs a method for generating a pronunciation sequence conversion model. The method includes generating a first pronunciation sequence from a speech input of a training pair and generating a second pronunciation sequence from a text input of the training pair. The method also includes determining a pronunciation sequence difference between the first pronunciation sequence and the second pronunciation sequence; and generating a pronunciation sequence conversion model based on the pronunciation sequence difference. An example speech recognition learning system performs a method for generating a pronunciation sequence conversion model. The method includes extracting an audio signal vector from a speech input and applying an audio signal conversion model to the audio signal vector to generate a converted audio signal vector. The method also includes adapting an acoustic model based on the converted audio signal vector to generate an adapted acoustic model.
18 Citations
19 Claims
-
1. A text-to-speech learning system, the system comprising:
-
at least one processor; and at least one storage device, operatively connected to the at least one processor and storing; at least one training corpus comprising a plurality of training pairs that represent a varied vocabulary from one or more speakers, each training pair comprising a speech input and a text input corresponding to the speech input; and instructions that, when executed by the at least processor, cause the at least one processor to perform a method for generating a pronunciation sequence conversion model, the method comprising; for each training pair; selecting a training pair from the at least one training corpus; generating a first pronunciation sequence from the speech input of the training pair; and generating a second pronunciation sequence from the text input of the training pair; determining a pronunciation sequence difference between the first pronunciation sequence and the second pronunciation sequence; and generating a pronunciation sequence conversion model based on a plurality of pronunciation sequence differences, wherein the pronunciation sequence conversion model is configured to synthesize speech by converting a pronunciation sequence generated in response to a received speech input to a target pronunciation sequence that more closely matches a pronunciation sequence extracted from the received speech input. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A speech recognition learning system, the system comprising:
-
at least one processor; and at least one storage device, operatively connected to the at least one processor and storing; at least one training corpus comprising a plurality of training pairs that represent a varied vocabulary from one or more speakers, each training pair comprising a speech input and a text input corresponding to the speech input; and instructions that, when executed by the at least processor, cause the at least one processor to perform a method for generating a pronunciation sequence conversion model, the method comprising; for each training pair, receiving a training pair from the at least one training corpus; extracting an audio signal vector from the speech input of the training pair; and applying an audio signal conversion model to the audio signal vector to generate a converted audio signal vector; and adapting an acoustic model based on a plurality of converted audio signal vectors to generate an adapted acoustic model, wherein the adapted acoustic model is used to generate a pronunciation sequence during a speech recognition operation. - View Dependent Claims (9, 10, 11, 12, 13, 14, 15, 16, 17)
-
-
18. A method for generating a text-to-speech model and a speech-recognition model, the method comprising:
-
generating a pronunciation sequence conversion model based on a plurality of pronunciation sequence differences, wherein each of the pronunciation sequence differences is associated with a training pair from a plurality of training pairs stored in at least one training corpus and representing a varied vocabulary from one or more speakers, and each of the pronunciation sequence differences is generated by comparing a first pronunciation sequence generated from a speech input of a training pair associated with the pronunciation sequence difference to a second pronunciation sequence generated from a text input of the training pair associated with the pronunciation sequence difference; and adapting an acoustic model based on a plurality of converted audio signal vectors, wherein each of the converted audio signal vectors is associated with a speech input from a plurality of speech inputs and the each of the converted audio signal vectors is generated by extracting an audio signal vector from the speech input associated with the converted audio signal vector and applying an audio signal vector conversion model to the extracted audio signal vector. - View Dependent Claims (19)
-
Specification