Constructing speech decoding network for numeric speech recognition
First Claim
1. A method for constructing a speech decoding network for recognizing digits in speech, comprising:
- acquiring primary training data comprising a plurality of speech segments, and each speech segment comprising a plurality of digits;
performing acoustic feature extraction on the primary training data to obtain a plurality of feature sequences from the plurality of speech segments;
performing progressive training to obtain a tri-phoneme acoustic model based on the plurality of feature sequences and a plurality of phonemes corresponding to the digits in the speech segments in the primary training data, including;
obtaining a mono-phoneme acoustic model by training a model with the plurality of feature sequences according to divided states of a plurality of mono-phonemes corresponding to the digits in the plurality of speech segments;
decoding the primary training data with the mono-phoneme acoustic model to obtain secondary training data;
obtaining the tri-phoneme acoustic model by training a model with a plurality of feature sequences in the secondary training data according to divided states of a plurality of tri-phonemes corresponding to digits in a plurality of speech segments in the secondary training data;
acquiring a language model by modeling matching relations of the plurality of digits in the primary training data; and
constructing a speech decoding network by using the language model and the tri-phoneme acoustic model obtained by training.
1 Assignment
0 Petitions
Accused Products
Abstract
The embodiments of the present disclosure disclose a method for constructing a speech decoding network in digital speech recognition. The method comprises acquiring training data obtained by digital speech recording, the training data comprising a plurality of speech segments, and each speech segment comprising a plurality of digital speeches; performing acoustic feature extraction on the training data to obtain a feature sequence corresponding to each speech segment; performing progressive training starting from a mono-phoneme acoustic model to obtain an acoustic model; acquiring a language model, and constructing a speech decoding network by the language model and the acoustic model obtained by training.
11 Citations
13 Claims
-
1. A method for constructing a speech decoding network for recognizing digits in speech, comprising:
-
acquiring primary training data comprising a plurality of speech segments, and each speech segment comprising a plurality of digits; performing acoustic feature extraction on the primary training data to obtain a plurality of feature sequences from the plurality of speech segments; performing progressive training to obtain a tri-phoneme acoustic model based on the plurality of feature sequences and a plurality of phonemes corresponding to the digits in the speech segments in the primary training data, including; obtaining a mono-phoneme acoustic model by training a model with the plurality of feature sequences according to divided states of a plurality of mono-phonemes corresponding to the digits in the plurality of speech segments; decoding the primary training data with the mono-phoneme acoustic model to obtain secondary training data; obtaining the tri-phoneme acoustic model by training a model with a plurality of feature sequences in the secondary training data according to divided states of a plurality of tri-phonemes corresponding to digits in a plurality of speech segments in the secondary training data; acquiring a language model by modeling matching relations of the plurality of digits in the primary training data; and constructing a speech decoding network by using the language model and the tri-phoneme acoustic model obtained by training. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. An apparatus for constructing a speech decoding network for recognizing digits in speech, comprising:
-
a training data acquisition module, configured to acquire primary training data comprising a plurality of speech segments, and each speech segment comprising a plurality of digits; an acoustic feature extraction module, configured to perform acoustic feature extraction on the primary training data to obtain a plurality of feature sequences from the plurality of speech segments; an acoustic model acquisition module, configured to perform progressive training to obtain a tri-phoneme acoustic model based on the plurality of feature sequences and a plurality of phonemes corresponding to the digits in the speech segments in the primary training data, including; obtaining a mono-phoneme acoustic model by training a model with the plurality of feature sequences according to divided states of a plurality of mono-phonemes corresponding to the digits in the plurality of speech segments; decoding the primary training data with the mono-phoneme acoustic model to obtain secondary training data; obtaining the tri-phoneme acoustic model by training a model with a plurality of feature sequences in the secondary training data according to divided states of a plurality of tri-phonemes corresponding to digits in a plurality of speech segments in the secondary training data; and a language model acquisition module, configured to acquire a language model by modeling matching relations of the plurality of digits in the primary training data and construct a speech decoding network by using the acquired language model and the tri-phoneme acoustic model obtained by training. - View Dependent Claims (8, 9, 10, 11, 12)
-
-
13. A non-transitory computer-readable storage medium storing machine-readable instructions that when executed by a processor, causes the processor to perform:
-
acquiring primary training data comprising a plurality of speech segments, and each speech segment comprising a plurality of digits; performing acoustic feature extraction on the primary training data to obtain a plurality of feature sequences from the plurality of speech segments; performing progressive training to obtain a tri-phoneme acoustic model based on the plurality of feature sequences and a plurality of phonemes corresponding to the digits in the speech segments in the primary training data, including; obtaining a mono-phoneme acoustic model by training a model with the plurality of feature sequences according to divided states of a plurality of mono-phonemes corresponding to the digits in the plurality of speech segments; decoding the primary training data with the mono-phoneme acoustic model to obtain secondary training data; obtaining the tri-phoneme acoustic model by training a model with a plurality of feature sequences in the secondary training data according to divided states of a plurality of tri-phonemes corresponding to digits in a plurality of speech segments in the secondary training data; acquiring a language model by modeling matching relations of the plurality of digits in the primary training data; and constructing a speech decoding network by using the language model and the tri-phoneme acoustic model obtained by training.
-
Specification