Using a discretized, higher order representation of hidden dynamic variables for speech recognition

US 7,680,663 B2
Filed: 08/21/2006
Issued: 03/16/2010
Est. Priority Date: 08/21/2006
Status: Active Grant

First Claim

Patent Images

1. A method of recognizing speech, comprising:

training parameters of a generative model based on speech training data indicative of indexed articulatory dynamic values calculated from the speech in the training data having different types of articulatory dynamics, the articulatory dynamic values being of at least second order and being represented by a distribution and the parameters of the generative model including a precision parameter trained based on a precision of the distribution of the articulatory dynamic;

receiving an observable acoustic value that describes a portion of a speech signal for a current time period under consideration;

identifying a predicted acoustic value for a hypothesized phonological unit, using the generative model, based on the indexed articulatory dynamics values and depending on indexed articulatory dynamics values calculated for at least two previous time periods; and

comparing the observed value to the predicted value to determine a likelihood of the hypothesized phonological unit.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A hidden dynamics value in speech is represented by a higher order, discretized dynamic model, which predicts the discretized dynamic variable that changes over time. Parameters are trained for the model. A decoder algorithm is developed for estimating the underlying phonological speech units in sequence that correspond to the observed speech signal using the higher order, discretized dynamic model.

Citations

12 Claims

1. A method of recognizing speech, comprising:
- training parameters of a generative model based on speech training data indicative of indexed articulatory dynamic values calculated from the speech in the training data having different types of articulatory dynamics, the articulatory dynamic values being of at least second order and being represented by a distribution and the parameters of the generative model including a precision parameter trained based on a precision of the distribution of the articulatory dynamic;
  
  receiving an observable acoustic value that describes a portion of a speech signal for a current time period under consideration;
  
  identifying a predicted acoustic value for a hypothesized phonological unit, using the generative model, based on the indexed articulatory dynamics values and depending on indexed articulatory dynamics values calculated for at least two previous time periods; and
  
  comparing the observed value to the predicted value to determine a likelihood of the hypothesized phonological unit.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The method of claim 1 wherein the indexed articulatory dynamics value comprises a vocal tract resonance (VTR) value.
  - 3. The method of claim 1 wherein identifying a predicted acoustic value comprises:
    - extracting a feature vector from the observed acoustic value; and
      
      applying the generative model, that represents articulatory dynamics hidden in a speech signal, to the feature vector.
  - 4. The method of claim 3 and further comprising:
    - prior to extracting a feature vector, constructing frames from the observed acoustic value, the frames being constructed for individual time periods.
  - 5. The method of claim 4 and further comprising:
    - outputting a selected phonological unit based on a likelihood of the hypothesized phonological unit.

6. A method of training a model for use in recognizing speech described by an observable input value, comprising:
- receiving observable training data indicative of a plurality of different types of speech; and
  
  training model parameters for an articulatory dynamics model that represents articulatory dynamics of speech that vary continuously over time and are represented by discrete values calculated from the observable training data for time periods, the model parameters being trained based on the discrete values of the articulatory dynamics calculated for at least two previous time periods;
  
  wherein training model parameters comprises;
  
  training the model parameters using expectation-maximization in which values of each parameter are first estimated using forward-backward recursion based on estimations of the articulatory dynamics from at least two previous time periods by re-estimating the model parameters based on a current estimation of the model parameters and estimates of the model parameters from at least two previous time periods; and
  
  training a precision parameter indicative of a precision of the value of the articulatory dynamics calculated.
- View Dependent Claims (7, 8, 9, 10)
- - 7. The method of claim 6 wherein training model parameters comprises:
    - training a residual value that compensates for inaccuracy or bias of a predicted value of an acoustic feature vector derived from an acoustic value that describes a portion of a speech signal.
  - 8. The method of claim 6 wherein training model parameters comprises:
    - training a hidden target value indicative of a value of the articulatory dynamics targeted by the training data.
  - 9. The method of claim 6 wherein the articulatory dynamic is represented as a distribution and wherein training model parameters comprises:
    - training a mean of the distribution.
  - 10. The method of claim 6 wherein the articulatory dynamic is represented as a distribution and wherein training model parameters comprises:
    - training a precision of the distribution of the articulatory dynamic.

11. A speech recognition system comprising:
- a generative model modeling articulatory dynamics hidden in an observed speech signal that extends over multiple time periods and mapping the articulatory dynamics to a measurable characteristic of the observed speech signal, the generative model modeling the articulatory dynamics based on discrete values of the articulatory dynamics estimated for at least two previous time periods;
  
  a decoder, coupled to the generative model, receiving an observed value describing at least a portion of the observed speech signal and selecting one or more hypothesized phonological units based on the measurable characteristic output by the generative model, corresponding to the observed value, and based on the observed value; and
  
  a training component training parameters of the generative model based on training data indicative of speech having different types of articulatory dynamics, wherein the training component trains the parameters of the generative model based on indexed articulatory dynamic values calculated from the training data and being of at least a second order, and the training component training one of the parameters of the generative model as a precision parameter indicative of a precision of the value of the articulatory dynamics calculated.
- View Dependent Claims (12)
- - 12. The speech recognition system of claim 11 wherein the decoder is configured to select the one or more hypothesized phonological units by identifying a sequence of articulatory dynamics and a corresponding sequence of measurable characteristics using the generative model and selecting a sequence of hypothesized phonological units based on the sequence of measurable characteristics identified.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Deng, Li
Primary Examiner(s)
Sked; Matthew J

Application Number

US11/507,169
Publication Number

US 20080046245A1
Time in Patent Office

1,303 Days
Field of Search

None
US Class Current

704/255
CPC Class Codes

G10L 15/02 Feature extraction for spee...

G10L 2015/025 Phonemes, fenemes or fenone...

Using a discretized, higher order representation of hidden dynamic variables for speech recognition

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

12 Claims

Specification

Solutions

Use Cases

Quick Links

Using a discretized, higher order representation of hidden dynamic variables for speech recognition

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

12 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links