Speech Recognition and Text-to-Speech Learning System
First Claim
1. A text-to-speech learning system, the system comprising:
- at least one processor; and
memory, operatively connected to the at least one processor and storing instructions that, when executed by the at least processor, cause the at least one processor to perform a method for generating a pronunciation sequence conversion model, the method comprising;
generating a first pronunciation sequence from a speech input of a training pair;
generating a second pronunciation sequence from a text input of the training pair;
determining a pronunciation sequence difference between the first pronunciation sequence and the second pronunciation sequence; and
generating a pronunciation sequence conversion model based on the pronunciation sequence difference.
1 Assignment
0 Petitions
Accused Products
Abstract
An example text-to-speech learning system performs a method for generating a pronunciation sequence conversion model. The method includes generating a first pronunciation sequence from a speech input of a training pair and generating a second pronunciation sequence from a text input of the training pair. The method also includes determining a pronunciation sequence difference between the first pronunciation sequence and the second pronunciation sequence; and generating a pronunciation sequence conversion model based on the pronunciation sequence difference. An example speech recognition learning system performs a method for generating a pronunciation sequence conversion model. The method includes extracting an audio signal vector from a speech input and applying an audio signal conversion model to the audio signal vector to generate a converted audio signal vector. The method also includes adapting an acoustic model based on the converted audio signal vector to generate an adapted acoustic model.
-
Citations
20 Claims
-
1. A text-to-speech learning system, the system comprising:
-
at least one processor; and memory, operatively connected to the at least one processor and storing instructions that, when executed by the at least processor, cause the at least one processor to perform a method for generating a pronunciation sequence conversion model, the method comprising; generating a first pronunciation sequence from a speech input of a training pair; generating a second pronunciation sequence from a text input of the training pair; determining a pronunciation sequence difference between the first pronunciation sequence and the second pronunciation sequence; and generating a pronunciation sequence conversion model based on the pronunciation sequence difference. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A speech recognition learning system, the system comprising:
-
at least one processor; and memory, operatively connected to the at least one processor and storing instructions that, when executed by the at least processor, cause the at least one processor to perform a method for generating a pronunciation sequence conversion model, the method comprising; extracting an audio signal vector from a speech input; applying an audio signal conversion model to the audio signal vector to generate a converted audio signal vector; and adapting an acoustic model based on the converted audio signal vector to generate an adapted acoustic model. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16, 17, 18)
-
-
19. A method for generating a text-to-speech model and a speech-recognition model, the method comprising:
-
generating a pronunciation sequence conversion model based on a plurality of pronunciation sequence differences, wherein each of the pronunciation sequence differences is associated with a training pair from a plurality of training pairs and each of the pronunciation sequence differences is generated by comparing a first pronunciation sequence generated from a speech input of a training pair associated with the pronunciation sequence difference to a second pronunciation sequence generated from the text input of the training pair associated with the pronunciation sequence difference; and adapting an acoustic model based on a plurality of converted audio signal vectors, wherein each of the converted audio signal vectors is associated with a speech input from a plurality of speech inputs and the each of the converted audio signal vectors is generated by extracting an audio signal vector from the speech input associated with the converted audio signal vector and applying an audio signal vector conversion model to the extracted audio signal vector. - View Dependent Claims (20)
-
Specification