Speech models generated using competitive training, asymmetric training, and data boosting

US 8,532,991 B2
Filed: 03/10/2010
Issued: 09/10/2013
Est. Priority Date: 06/17/2005
Status: Expired due to Fees

First Claim

Patent Images

1. A method of training a speech model, comprising:

obtaining model parameters for the speech model;

processing a known speech input using the speech model with the model parameters to generate a process result;

calculating a distance between a true result and the process result, given the model parameters and the known speech input, the true result comprising a true transcription, the true transcription corresponding to only the following waveform states;

silence, noise, onset and speech, instead of a phonetic transcription; and

modifying the model parameters to reduce the distance between the true result and the process result, to obtain a modified model, wherein reducing the distance between the true result and the process result comprises maximizing a function comprising a parameter set for an acoustic model and a super utterance, the super utterance comprising a feature vector sequence, the true result and the process result.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Speech models are trained using one or more of three different training systems. They include competitive training which reduces a distance between a recognized result and a true result, data boosting which divides and weights training data, and asymmetric training which trains different model components differently.

17 Citations

View as Search Results

20 Claims

1. A method of training a speech model, comprising:
- obtaining model parameters for the speech model;
  
  processing a known speech input using the speech model with the model parameters to generate a process result;
  
  calculating a distance between a true result and the process result, given the model parameters and the known speech input, the true result comprising a true transcription, the true transcription corresponding to only the following waveform states;
  
  silence, noise, onset and speech, instead of a phonetic transcription; and
  
  modifying the model parameters to reduce the distance between the true result and the process result, to obtain a modified model, wherein reducing the distance between the true result and the process result comprises maximizing a function comprising a parameter set for an acoustic model and a super utterance, the super utterance comprising a feature vector sequence, the true result and the process result.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1 and further comprising:
    - iterating on the steps of processing, calculating and modifying until the model parameters reach a desired convergence.
  - 3. The method of claim 1 wherein the speech model comprises the acoustic model, and wherein processing a known speech input to generate a process result comprises:
    - performing speech recognition on acoustic data indicative of a known acoustic input, using the acoustic model with the model parameters, to generate a speech recognition result.
  - 4. The method of claim 3 wherein the true result comprises a true transcription of the acoustic data and wherein calculating a distance comprises:
    - calculating a true transcription measure indicative of a probability of generating the true transcription, given the acoustic data and the model parameters;
      
      calculating a speech recognition measure indicative of a probability of generating the speech recognition result, given the acoustic data and the model parameters; and
      
      calculating the distance as the difference between the transcription measure and the speech recognition measure.
  - 5. The method of claim 1 wherein the speech model comprises a speech detection model, and wherein processing a known speech input to generate a process result comprises:
    - performing speech detection on acoustic data indicative of an input signal to generate a detection state output indicative of a decision made by the speech detection model as to whether the input signal represents speech or non-speech.
  - 6. The method of claim 5 wherein the true result comprises a true detection state output indicative of whether the input signal represents speech or non-speech and wherein calculating a distance comprises:
    - calculating a true detection state measure indicative of a probability that the speech detection model will make a decision that the input signal represents the true detection state output, given the acoustic data and the model parameters;
      
      calculating a speech detection state measure indicative of a probability that the speech detection model will make a decision that the input signal represents the detection state output, given the acoustic data and the model parameters; and
      
      calculating the difference as a difference between the true detection state measure and the speech detection state measure.
  - 7. The method of claim 5 wherein performing speech detection comprises:
    - generating the detection state output indicative of whether the input signal represents speech, silence, noise or an onset of speech.
  - 8. The method of claim 1 wherein obtaining model parameters for the speech model comprises:
    - performing maximum likelihood training on training data to obtain an initial model with initial model parameters.

9. A computer-readable storage device with computer-executable instructions stored thereon which, when executed by a computer, perform a method for training a speech model, the method comprising:
- obtaining model parameters for the speech model;
  
  processing a known speech input using the speech model with the model parameters to generate a process result;
  
  calculating a distance between a true result and the process result, given the model parameters and the known speech input, the true result comprising a true transcription, the true transcription corresponding to only the following waveform states;
  
  silence, noise, onset and speech, instead of a phonetic transcription;
  
  modifying the model parameters to reduce the distance between the true result and the process result, to obtain a modified model, wherein reducing the distance between the true result and the process result comprises maximizing a function comprising a parameter set for an acoustic model and a super utterance, the super utterance comprising a feature vector sequence, the true result and the process result; and
  
  iterating on the steps of processing, calculating and modifying until the model parameters reach a desired convergence.
- View Dependent Claims (10, 11, 12, 13, 14, 15)
- - 10. The computer-readable storage device of claim 9 wherein the speech model comprises the acoustic model, and wherein processing a known speech input to generate a process result comprises:
    - performing speech recognition on acoustic data indicative of a known acoustic input, using the acoustic model with the model parameters, to generate a speech recognition result.
  - 11. The computer-readable storage device of claim 10 wherein the true result comprises a true transcription of the acoustic data and wherein calculating a distance comprises:
    - calculating a true transcription measure indicative of a probability of generating the true transcription, given the acoustic data and the model parameters;
      
      calculating a speech recognition measure indicative of a probability of generating the speech recognition result, given the acoustic data and the model parameters; and
      
      calculating the distance as the difference between the transcription measure and the speech recognition measure.
  - 12. The computer-readable storage device of claim 9 wherein the speech model comprises a speech detection model, and wherein processing a known speech input to generate a process result comprises:
    - performing speech detection on acoustic data indicative of an input signal to generate a detection state output indicative of a decision made by the speech detection model as to whether the input signal represents speech or non-speech.
  - 13. The computer-readable storage device of claim 12 wherein the true result comprises a true detection state output indicative of whether the input signal represents speech or non-speech and wherein calculating a distance comprises:
    - calculating a true detection state measure indicative of a probability that the speech detection model will make a decision that the input signal represents the true detection state output, given the acoustic data and the model parameters;
      
      calculating a speech detection state measure indicative of a probability that the speech detection model will make a decision that the input signal represents the detection state output, given the acoustic data and the model parameters; and
      
      calculating the difference as a difference between the true detection state measure and the speech detection state measure.
  - 14. The computer-readable storage device of claim 12 wherein performing speech detection comprises:
    - generating the detection state output indicative of whether the input signal represents speech, silence, noise or an onset of speech.
  - 15. The computer-readable storage device of claim 9 wherein obtaining model parameters for the speech model comprises:
    - performing maximum likelihood training on training data to obtain an initial model with initial model parameters.

16. A method of training a speech model, comprising:
- obtaining model parameters for the speech model, wherein the speech model comprises an acoustic model and wherein obtaining model parameters for the speech model comprises performing maximum likelihood training on training data to obtain an initial model with initial model parameters;
  
  processing a known speech input using the speech model with the model parameters to generate a process result, wherein processing a known speech input to generate a process result comprises performing speech recognition on acoustic data indicative of a known acoustic input, using the acoustic model with the model parameters, to generate a speech recognition result;
  
  calculating a distance between a true result and the process result, given the model parameters and the known speech input, the true result comprising a true transcription, the true transcription corresponding to only the following waveform states;
  
  silence, noise, onset and speech, instead of a phonetic transcription;
  
  modifying the model parameters to reduce the distance between the true result and the process result, to obtain a modified model wherein reducing the distance between the true result and the process result comprises maximizing a logarithmic function comprising a parameter set for the acoustic model and a super utterance, the super utterance comprising a feature vector sequence, the true result and the process result; and
  
  iterating on the steps of processing, calculating and modifying until the model parameters reach a desired convergence.
- View Dependent Claims (17, 18, 19, 20)
- - 17. The method of claim 16 wherein the true result comprises a true transcription of the acoustic data and wherein calculating a distance comprises:
    - calculating a true transcription measure indicative of a probability of generating the true transcription, given the acoustic data and the model parameters;
      
      calculating a speech recognition measure indicative of a probability of generating the speech recognition result, given the acoustic data and the model parameters; and
      
      calculating the distance as the difference between the transcription measure and the speech recognition measure.
  - 18. The method of claim 16 wherein the speech model comprises a speech detection model, and wherein processing a known speech input to generate a process result comprises:
    - performing speech detection on acoustic data indicative of an input signal to generate a detection state output indicative of a decision made by the speech detection model as to whether the inputsignal represents speech or non-speech.
  - 19. The method of claim 18 wherein the true result comprises a true detection state output indicative of whether the input signal represents speech or non-speech and wherein calculating a distance comprises:
    - calculating a true detection state measure indicative of a probability that the speech detection model will make a decision that the input signal represents the true detection state output, given the acoustic data and the model parameters;
      
      calculating a speech detection state measure indicative of a probability that the speech detection model will make a decision that the input signal represents the detection state output, given the acoustic data and the model parameters; and
      
      calculating the difference as a difference between the true detection state measure and the speech detection state measure.
  - 20. The method of claim 18 wherein performing speech detection comprises:
    - generating the detection state output indicative of whether the input signal represents speech, silence, noise or an onset of speech.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
He, Xiaodong, Wu, Jian
Primary Examiner(s)
Armstrong, Angela A

Application Number

US12/720,968
Publication Number

US 20100161330A1
Time in Patent Office

1,280 Days
Field of Search

None
US Class Current

704/243
CPC Class Codes

G10L 15/063 Training

Speech models generated using competitive training, asymmetric training, and data boosting

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

17 Citations

20 Claims

Specification

Use Cases

Quick Links

Others

Speech models generated using competitive training, asymmetric training, and data boosting

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

17 Citations

20 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others