Low latency real-time vocal tract length normalization
First Claim
1. A method comprising:
- performing, for a speaker specific segment of training data;
generating spectral data representative of the speaker specific segment, the spectral data comprising a plurality of warping factors;
selecting a first warping factor as a best warping factor from the plurality of warping factors based on a determination made during speech recognition of the speaker specific segment; and
generating a warped spectral data representation of the spectral data using the first warping factor;
iteratively carrying out, until a comparison indicates a warping factor difference below a threshold, the operations of;
generating another warped spectral data representation using another warping factor;
comparing the other warped spectral data representation to the warped spectral data representation, to yield the comparison; and
when the other warping factor produce a closer match to the warped spectral data representation, saving the other warping factor as a best warping factor for the speaker specific segment; and
training a new acoustic model using a warped spectral data representation of all the training data that is generated using the best warping factor for each of the speaker specific segments.
4 Assignments
0 Petitions
Accused Products
Abstract
A method and system for training an automatic speech recognition system are provided. The method includes separating training data into speaker specific segments, and for each speaker specific segment, performing the following acts: generating spectral data, selecting a first warping factor and warping the spectral data, and comparing the warped spectral data with a speech model. The method also includes iteratively performing the steps of selecting another warping factor and generating another warped spectral data, comparing the other warped spectral data with the speech model, and if the other warping factor produces a closer match to the speech model, saving the other warping factor as the best warping factor for the speaker specific segment. The system includes modules configured to control a processor in the system to perform the steps of the method.
-
Citations
20 Claims
-
1. A method comprising:
-
performing, for a speaker specific segment of training data; generating spectral data representative of the speaker specific segment, the spectral data comprising a plurality of warping factors; selecting a first warping factor as a best warping factor from the plurality of warping factors based on a determination made during speech recognition of the speaker specific segment; and generating a warped spectral data representation of the spectral data using the first warping factor; iteratively carrying out, until a comparison indicates a warping factor difference below a threshold, the operations of; generating another warped spectral data representation using another warping factor; comparing the other warped spectral data representation to the warped spectral data representation, to yield the comparison; and when the other warping factor produce a closer match to the warped spectral data representation, saving the other warping factor as a best warping factor for the speaker specific segment; and training a new acoustic model using a warped spectral data representation of all the training data that is generated using the best warping factor for each of the speaker specific segments. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A system comprising:
-
a processor; and a computer-readable storage medium having instructions stored which, when executed by the processor, result in the processor performing operations comprising; performing, for a speaker specific segment of training data; generating spectral data representative of the speaker specific segment, the spectral data comprising a plurality of warping factors; selecting a first warping factor as a best warping factor from the plurality of warping factors based on a determination made during speech recognition of the speaker specific segment; and generating a warped spectral data representation of the spectral data using the first warping factor; iteratively carrying out, until a comparison indicates a warping factor difference below a threshold, the operations of; generating another warped spectral data representation using another warping factor; comparing the other warped spectral data representation to the warped spectral data representation, to yield the comparison; and when the other warping factor produces a closer match to the warped spectral data representation, saving the other warping factor as a best warping factor for the speaker specific segment; and training a new acoustic model using a warped spectral data representation of all the training data that is generated using the best warping factor for each of the speaker specific segments. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. A computer-readable storage device having instructions stored which, when executed by a computing device, result in the computing device performing operations comprising:
-
performing, for a speaker specific segment of training data; generating spectral data representative of the speaker specific segment, the spectral data comprising a plurality of warping factors; selecting a first warping factor as a best warping factor from the plurality of warping factors based on a determination made during speech recognition of the speaker specific segment; and generating a warped spectral data representation of the spectral data using the first warping factor; iteratively carrying out, until a comparison indicates a warping factor difference below a threshold, the operations of; generating another warped spectral data representation using another warping factor; comparing the other warped spectral data representation to the warped spectral data representation, to yield the comparison; and when the other warping factors produces a closer match to the warped spectral data representation, saving the other warping factor as a best warping factor for the speaker specific segment; and training a new acoustic model using a warped spectral data representation of all the training data that is generated using the best warping factor for each of the speaker specific segments. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification