Method and system for acoustic data selection for training the parameters of an acoustic model

US 10,157,610 B2
Filed: 12/21/2017
Issued: 12/18/2018
Est. Priority Date: 08/07/2012
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method for training acoustic models in an automatic speech recognition system through a selection of acoustic data comprising:

training a first acoustic model in the automatic speech recognition system using a training-data corpus comprising a plurality of speech audio files and a respective plurality of transcriptions for the plurality of speech audio files;

performing a forced Viterbi alignment of the plurality of speech audio files using the trained first acoustic model in the automatic speech recognition system;

calculating a global frame likelihood score δ

for the plurality of speech audio files, wherein the global frame likelihood score δ

comprises an average of frame likelihoods over the training-data corpus;

creating a first subset of the training-data corpus comprising one or more audio files by selecting the one or more audio files from the plurality of speech audio files based on the global frame likelihood score δ

;

performing a phoneme recognition of the plurality of speech audio files using the trained first acoustic model and the respective plurality of transcriptions in the automatic speech recognition system;

calculating a global phoneme recognition accuracy ν

for the plurality of speech audio files;

creating a second subset of the training-data corpus comprising audio files retained from the one or more audio files of the first subset of the training-data corpus which meet at least one predetermined criterion indicating that an audio file has good audio quality; and

training a second acoustic model in the automatic speech recognition system using the second subset of the training-data corpus.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method are presented for acoustic data selection of a particular quality for training the parameters of an acoustic model, such as a Hidden Markov Model and Gaussian Mixture Model, for example, in automatic speech recognition systems in the speech analytics field. A raw acoustic model may be trained using a given speech corpus and maximum likelihood criteria. A series of operations are performed, such as a forced Viterbi-alignment, calculations of likelihood scores, and phoneme recognition, for example, to form a subset corpus of training data. During the process, audio files of a quality that does not meet a criterion, such as poor quality audio files, may be automatically rejected from the corpus. The subset may then be used to train a new acoustic model.

10 Citations

View as Search Results

26 Claims

1. A computer-implemented method for training acoustic models in an automatic speech recognition system through a selection of acoustic data comprising:
- training a first acoustic model in the automatic speech recognition system using a training-data corpus comprising a plurality of speech audio files and a respective plurality of transcriptions for the plurality of speech audio files;
  
  performing a forced Viterbi alignment of the plurality of speech audio files using the trained first acoustic model in the automatic speech recognition system;
  
  calculating a global frame likelihood score δ
  
  for the plurality of speech audio files, wherein the global frame likelihood score δ
  
  comprises an average of frame likelihoods over the training-data corpus;
  
  creating a first subset of the training-data corpus comprising one or more audio files by selecting the one or more audio files from the plurality of speech audio files based on the global frame likelihood score δ
  
  ;
  
  performing a phoneme recognition of the plurality of speech audio files using the trained first acoustic model and the respective plurality of transcriptions in the automatic speech recognition system;
  
  calculating a global phoneme recognition accuracy ν
  
  for the plurality of speech audio files;
  
  creating a second subset of the training-data corpus comprising audio files retained from the one or more audio files of the first subset of the training-data corpus which meet at least one predetermined criterion indicating that an audio file has good audio quality; and
  
  training a second acoustic model in the automatic speech recognition system using the second subset of the training-data corpus.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
- - 2. The method of claim 1, wherein the training the first acoustic model further comprises:
    - calculating a maximum likelihood criterion of the training-data corpus; and
      
      estimating parameters of a probability distribution of the first acoustic model that maximize the maximum likelihood criterion.
  - 3. The method of claim 1, wherein the first acoustic model comprises a Hidden Markov Model and a Gaussian Mixture Model.
  - 4. The method of claim 1, wherein the performing the forced Viterbi alignment further comprises:
    - obtaining a total likelihood score α
      
      _rfor each of the plurality of speech audio files.
  - 5. The method of claim 4, wherein α
    - _r=p(x₁|q₁)Π
      
      _i=2^NP(q_i|q_i-1)p(x_i|q_i), where P(q_i|q_i-1) represents a Hidden Markov Model state transition probability between states ‘
      
      i−
      
      1’ and
      
      ‘
      
      i’ and
      
      p(x_i|q_i) represents a state emission likelihood of a feature vector x_ibeing present in a state q_i.
  - 6. The method of claim 4, further comprising using a mathematical equation
  - 7. The method of claim 1, wherein performing the forced Viterbi alignment further comprises:
    - determining an average frame likelihood score β
      
      _rfor each of the plurality of speech audio files.
  - 8. The method of claim 1, further comprises:
    - calculating a phoneme recognition accuracy γ
      
      for each of the plurality of speech audio files.
  - 9. The method of claim 1, wherein the at least one predetermined criterion comprising at least one criterion selected from a group comprising:
    - a first criterion based on an average frame likelihood score β
      
      of the retained speech audio files and the global frame likelihood score δ
      
      ; and
      
      a second criterion based on a phoneme recognition accuracy γ
      
      of the retained speech audio files and the global phoneme recognition accuracy ν
      
      .
  - 10. The method of claim 9, wherein the first criterion comprises determining whether the average frame likelihood score of the retained speech audio files satisfies a criterion β
    - _r≥
      
      δ
      
      +Δ
      
      , where Δ
      
      is a first predetermined threshold, and wherein the second criterion comprises determining whether the phoneme recognition accuracy γ
      
      _gof the retained speech audio files satisfies a criterion γ
      
      _g≥
      
      ν
      
      +μ
      
      , where μ
      
      is a second predetermined threshold.
  - 11. The method of claim 10, wherein Δ
    - =−
      
      0.18.
  - 12. The method of claim 10, wherein μ
    - =−
      
      0.2ν
      
      .
  - 13. The method of claim 1, further comprising using a mathematical equation
  - 14. The method of claim 1, further comprising using a mathematical equation

15. A computer-implemented method for training acoustic models in an automatic speech recognition system comprising:
- training a first acoustic model in the automatic speech recognition system using a speech corpus comprising a plurality of speech audio files and a respective plurality of transcriptions for the plurality of speech audio files by calculating a maximum likelihood criterion of the speech corpus and estimating parameters of a probability distribution of the first acoustic model that maximize the maximum likelihood criterion;
  
  performing a forced Viterbi alignment of the plurality of speech audio files using the trained first acoustic model in the automatic speech recognition system;
  
  calculating a global frame likelihood score δ
  
  for the plurality of speech audio files, wherein the global frame likelihood score δ
  
  comprises an average of frame likelihoods over the speech corpus;
  
  creating a first subset of the speech corpus comprising one or more audio files by selecting the one or more audio files from the plurality of speech audio files based on the global frame likelihood score δ
  
  ;
  
  performing a phoneme recognition of the plurality of speech audio files using the trained first acoustic model and the respective plurality of transcriptions in the automatic speech recognition system;
  
  calculating a global phoneme recognition accuracy ν
  
  for the plurality of speech audio files;
  
  creating a second subset of the speech corpus comprising audio files retained from the one or more audio files of the first subset of the speech corpus which meet at least one predetermined criterion indicating that an audio file has good audio quality; and
  
  training a second acoustic model in the automatic speech recognition system with said second subset of the speech corpus.
- View Dependent Claims (16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26)
- - 16. The method of claim 15, wherein the at least one predetermined criterion comprises at least one criterion selected from a group comprising:
    - a first criterion based on an average frame likelihood score β
      
      of the retained speech audio files and the global frame likelihood score δ
      
      ; and
      
      a second criterion based on a phoneme recognition accuracy γ
      
      of the retained speech audio files and the global phoneme recognition accuracy ν
      
      .
  - 17. The method of claim 16, wherein the first criterion comprises determining whether an average frame likelihood score β
    - _rof the retained speech audio files satisfies a criterion β
      
      _r≥
      
      δ
      
      +Δ
      
      , where Δ
      
      is a first predetermined threshold, and wherein the second criterion comprises determining whether a phoneme recognition accuracy γ
      
      _gof the retained speech audio files satisfies a criterion γ
      
      _g≥
      
      ν
      
      +μ
      
      , where μ
      
      is a second predetermined threshold.
  - 18. The method of claim 17, further comprising using a mathematical equation:
  - 19. The method of claim 15, wherein performing the forced Viterbi alignment further comprises:
    - obtaining a total likelihood score α
      
      _rfor each audio file of the plurality of speech audio files.
  - 20. The method of claim 19, wherein the total likelihood score α
    - _ris obtained using a mathematical equation;
      
      α
      
      _r=p(x_i|q_i)Π
      
      _i=2^NP(q_i|q_i-1)p(x_i|q_i), where P(q_i|q_i-1) represents a Hidden Markov Model state transition probability between states ‘
      
      i−
      
      1’ and
      
      ‘
      
      i’ and
      
      p(x_i|q_i) represents a state emission likelihood of a feature vector x_ibeing present in a state q_i.
  - 21. The method of claim 19, wherein an average frame likelihood score of an audio file is obtained using a mathematical equation:
  - 22. The method of claim 19, wherein the speech corpus contains varying quality audio files.
  - 23. The method of claim 15, wherein performing the forced Viterbi alignment further comprises:
    - determining an average frame likelihood score for each of the plurality of speech audio files.
  - 24. The method of claim 15, further comprises:
    - a phoneme recognition accuracy γ
      
      for each of the plurality of speech audio files.
  - 25. The method of claim 15, wherein creating the subset speech corpus comprises automatically rejecting bad quality files and transcriptions from the speech corpus.
  - 26. The method of claim 15, further comprising using a mathematical equation:

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Genesys Telecommunications Laboratories Incorporated (Genesys Cloud Services Incorporated)
Original Assignee
Interactive Intelligence Group Incorporated (Genesys Cloud Services Incorporated)
Inventors
Tyagi, Vivek, Ganapathiraju, Aravind, Wyss, Felix Immanuel
Primary Examiner(s)
Patel, Shreyans A

Application Number

US15/850,106
Publication Number

US 20180114525A1
Time in Patent Office

362 Days
Field of Search
US Class Current
CPC Class Codes

G10L 15/063   Training

G10L 15/144   Training of HMMs

G10L 2015/025   Phonemes, fenemes or fenone...

Method and system for acoustic data selection for training the parameters of an acoustic model

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

10 Citations

26 Claims

Specification

Solutions

Use Cases

Quick Links

Method and system for acoustic data selection for training the parameters of an acoustic model

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

10 Citations

26 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links