Constructing speech decoding network for numeric speech recognition

US 10,699,699 B2
Filed: 05/30/2018
Issued: 06/30/2020
Est. Priority Date: 03/29/2016
Status: Active Grant

First Claim

Patent Images

1. A method for constructing a speech decoding network for recognizing digits in speech, comprising:

acquiring primary training data comprising a plurality of speech segments, and each speech segment comprising a plurality of digits;

performing acoustic feature extraction on the primary training data to obtain a plurality of feature sequences from the plurality of speech segments;

performing progressive training to obtain a tri-phoneme acoustic model based on the plurality of feature sequences and a plurality of phonemes corresponding to the digits in the speech segments in the primary training data, including;

obtaining a mono-phoneme acoustic model by training a model with the plurality of feature sequences according to divided states of a plurality of mono-phonemes corresponding to the digits in the plurality of speech segments;

decoding the primary training data with the mono-phoneme acoustic model to obtain secondary training data;

obtaining the tri-phoneme acoustic model by training a model with a plurality of feature sequences in the secondary training data according to divided states of a plurality of tri-phonemes corresponding to digits in a plurality of speech segments in the secondary training data;

acquiring a language model by modeling matching relations of the plurality of digits in the primary training data; and

constructing a speech decoding network by using the language model and the tri-phoneme acoustic model obtained by training.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The embodiments of the present disclosure disclose a method for constructing a speech decoding network in digital speech recognition. The method comprises acquiring training data obtained by digital speech recording, the training data comprising a plurality of speech segments, and each speech segment comprising a plurality of digital speeches; performing acoustic feature extraction on the training data to obtain a feature sequence corresponding to each speech segment; performing progressive training starting from a mono-phoneme acoustic model to obtain an acoustic model; acquiring a language model, and constructing a speech decoding network by the language model and the acoustic model obtained by training.

11 Citations

View as Search Results

13 Claims

1. A method for constructing a speech decoding network for recognizing digits in speech, comprising:
- acquiring primary training data comprising a plurality of speech segments, and each speech segment comprising a plurality of digits;
  
  performing acoustic feature extraction on the primary training data to obtain a plurality of feature sequences from the plurality of speech segments;
  
  performing progressive training to obtain a tri-phoneme acoustic model based on the plurality of feature sequences and a plurality of phonemes corresponding to the digits in the speech segments in the primary training data, including;
  
  obtaining a mono-phoneme acoustic model by training a model with the plurality of feature sequences according to divided states of a plurality of mono-phonemes corresponding to the digits in the plurality of speech segments;
  
  decoding the primary training data with the mono-phoneme acoustic model to obtain secondary training data;
  
  obtaining the tri-phoneme acoustic model by training a model with a plurality of feature sequences in the secondary training data according to divided states of a plurality of tri-phonemes corresponding to digits in a plurality of speech segments in the secondary training data;
  
  acquiring a language model by modeling matching relations of the plurality of digits in the primary training data; and
  
  constructing a speech decoding network by using the language model and the tri-phoneme acoustic model obtained by training.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The method according to claim 1, including, before the step of acquiring the primary training data:
    - recording speech segments containing the plurality of digits to obtain training data according to preset conditions, the plurality of speech segments in the primary training data being corresponding to a same person.
  - 3. The method according to claim 1, wherein performing acoustic feature extraction on the primary training data to obtain the plurality of feature sequences from the plurality of speech segments comprises:
    - segmenting a respective speech segment of the plurality of speech segments according to a preset length to obtain a plurality of speech frames for the respective speech segment;
      
      extracting Mel frequency cepstrum coefficient (MFCC) features and PITCH features from each of the plurality of speech frames included in the respective speech segment; and
      
      calculating to obtain a feature vector of each of the plurality of speech frames via the MFCC features and PITCH features, to further constitute a feature sequence corresponding to each speech segment.
  - 4. The method according to claim 1, wherein training the model with the plurality of feature sequences according to the divided states of the plurality of mono-phonemes to obtain the mono-phoneme acoustic model comprises:
    - performing state description on the plurality of mono-phonemes by using a hidden Markov model (HMM) to obtain the divided states of the plurality of mono-phonemes;
      
      modeling the plurality of feature sequences by using a Gaussian mixture model (GMM) based on the divided states of the plurality of mono-phonemes to obtain a GMM-HMM;
      
      randomly initializing parameters of the GMM-HMM, and performing iterative optimization on the parameters obtained by random initialization by using an expectation maximization algorithm; and
      
      determining that the GMM-HMM is the mono-phoneme acoustic model when the optimized parameters allow the GMM-HMM to be converged.
  - 5. The method according to claim 1, wherein training the model with the plurality of feature sequences in the secondary training data according to the divided states of the plurality of tri-phonemes to obtain the tri-phoneme acoustic model comprises:
    - performing state description on the plurality of tri-phonemes by using a hidden Markov model (HMM) to obtain the divided states of the plurality of tri-phonemes;
      
      modeling the plurality of feature sequences by using a Gaussian mixture model (GMM) to obtain a GMM-HMM based on the divided states of the plurality of tri-phonemes;
      
      performing parameter estimation on parameters of the GMM-HMM according to the secondary training data, and performing iterative optimization on the parameters obtained by parameter estimation by using an expectation maximization algorithm; and
      
      determining that the GMM-HMM is the tri-phoneme acoustic model when the optimized parameters allow the GMM-HMM to be converged.
  - 6. The method according to claim 1, wherein the matching relations comprise:
    - a matching relation between the plurality of digits in the primary training data and phone number arrangement rules, or, a matching relation between the plurality of digits in the primary training data and a predefined list of random codes.

7. An apparatus for constructing a speech decoding network for recognizing digits in speech, comprising:
- a training data acquisition module, configured to acquire primary training data comprising a plurality of speech segments, and each speech segment comprising a plurality of digits;
  
  an acoustic feature extraction module, configured to perform acoustic feature extraction on the primary training data to obtain a plurality of feature sequences from the plurality of speech segments;
  
  an acoustic model acquisition module, configured to perform progressive training to obtain a tri-phoneme acoustic model based on the plurality of feature sequences and a plurality of phonemes corresponding to the digits in the speech segments in the primary training data, including;
  
  obtaining a mono-phoneme acoustic model by training a model with the plurality of feature sequences according to divided states of a plurality of mono-phonemes corresponding to the digits in the plurality of speech segments;
  
  decoding the primary training data with the mono-phoneme acoustic model to obtain secondary training data;
  
  obtaining the tri-phoneme acoustic model by training a model with a plurality of feature sequences in the secondary training data according to divided states of a plurality of tri-phonemes corresponding to digits in a plurality of speech segments in the secondary training data; and
  
  a language model acquisition module, configured to acquire a language model by modeling matching relations of the plurality of digits in the primary training data and construct a speech decoding network by using the acquired language model and the tri-phoneme acoustic model obtained by training.
- View Dependent Claims (8, 9, 10, 11, 12)
- - 8. The apparatus according to claim 7, wherein the apparatus further comprises:
    - a training data recording module, configured to record speech segments containing the plurality of digits according to preset conditions to obtain training data, the plurality of speech segments in the primary training data being corresponding to a same person.
  - 9. The apparatus according to claim 7, wherein the acoustic feature extraction module comprises:
    - a speech segment segmenting unit, configured to segment a respective speech segment of the plurality of speech segments according to a preset length to obtain a plurality of speech frames for the respective speech segment;
      
      a feature sequence generation unit, configured to extract Mel frequency cepstrum coefficient (MFCC) features and PITCH features from each of the plurality of speech frames included in the respective speech segment; and
      
      calculating to obtain a feature vector of each of the plurality of speech frames via the MFCC features and PITCH features, to further constitute a feature sequence corresponding to each speech segment.
  - 10. The apparatus according to claim 7, wherein the acoustic model acquisition module comprises:
    - a first state description unit, configured to perform state description on the plurality of mono-phonemes by using an HMM to obtain the divided states of the plurality of mono-phonemes;
      
      a first modeling unit, configured to model the plurality of feature sequence by using a GMM based on the divided states of the plurality of mono-phonemes to obtain a GMM-HMM; and
      
      a first training unit, configured to randomly initialize parameters of the GMM-HMM and perform iterative optimization on the parameters obtained by random initialization by using an expectation maximization algorithm,the GMM-HMM being determined to be the mono-phoneme acoustic model when the optimized parameters allow the GMM-HMM to be converged.
  - 11. The apparatus according to claim 7, wherein the acoustic model acquisition module further comprises:
    - a second state description unit, configured to perform state description on the plurality of tri-phonemes by using a hidden Markov model (HMM) to obtain the divided states of the plurality of tri-phonemes;
      
      a second modeling unit, configured to model the plurality of feature sequences by using a Gaussian mixture model (GMM) to obtain a GMM-HMM based on the divided states of the plurality of tri-phonemes; and
      
      a second training unit, configured to perform parameter estimation on parameters of the GMM-HMM according to the secondary training data, and perform iterative optimization on the parameters obtained by parameter estimation by using an expectation maximization algorithm,the GMM-HMM being determined to be the tri-phoneme acoustic model when the optimized parameters allow the GMM-HMM to be converged.
  - 12. The apparatus according to claim 7, wherein the language model is obtained by modeling matching relations of the plurality of digits in the primary training data, the matching relation comprising:
    - a matching relation between the plurality of digits in the primary training data and phone number arrangement rules, or, a matching relation between the plurality of digits in the primary training data and a predefined list of random codes.

13. A non-transitory computer-readable storage medium storing machine-readable instructions that when executed by a processor, causes the processor to perform:
- acquiring primary training data comprising a plurality of speech segments, and each speech segment comprising a plurality of digits;
  
  performing acoustic feature extraction on the primary training data to obtain a plurality of feature sequences from the plurality of speech segments;
  
  performing progressive training to obtain a tri-phoneme acoustic model based on the plurality of feature sequences and a plurality of phonemes corresponding to the digits in the speech segments in the primary training data, including;
  
  obtaining a mono-phoneme acoustic model by training a model with the plurality of feature sequences according to divided states of a plurality of mono-phonemes corresponding to the digits in the plurality of speech segments;
  
  decoding the primary training data with the mono-phoneme acoustic model to obtain secondary training data;
  
  obtaining the tri-phoneme acoustic model by training a model with a plurality of feature sequences in the secondary training data according to divided states of a plurality of tri-phonemes corresponding to digits in a plurality of speech segments in the secondary training data;
  
  acquiring a language model by modeling matching relations of the plurality of digits in the primary training data; and
  
  constructing a speech decoding network by using the language model and the tri-phoneme acoustic model obtained by training.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Tencent Technology Company Limited (Tencent Holdings Limited)
Original Assignee
Tencent Technology Company Limited (Tencent Holdings Limited)
Inventors
Wu, Fuzhang, Qian, Binghua, Li, Wei, Li, Ke, Wu, Yongjian, Huang, Feiyue
Primary Examiner(s)
Jackson, Jakieda R

Application Number

US15/993,332
Publication Number

US 20180277103A1
Time in Patent Office

762 Days
Field of Search
US Class Current
CPC Class Codes

G10L 15/02   Feature extraction for spee...

G10L 15/04   Segmentation; Word boundary...

G10L 15/063   Training

G10L 15/142   Hidden Markov Models [HMMs]

G10L 15/144   Training of HMMs

G10L 15/187   Phonemic context, e.g. pron...

G10L 2015/025   Phonemes, fenemes or fenone...

G10L 2015/0631   Creating reference template...

G10L 25/24   the extracted parameters be...

G10L 25/90   Pitch determination of spee...

Constructing speech decoding network for numeric speech recognition

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

11 Citations

13 Claims

Specification

Solutions

Use Cases

Quick Links

Constructing speech decoding network for numeric speech recognition

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

11 Citations

13 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links