Using a discretized, higher order representation of hidden dynamic variables for speech recognition

US 20080046245A1
Filed: 08/21/2006
Published: 02/21/2008
Est. Priority Date: 08/21/2006
Status: Active Grant

First Claim

Patent Images

1. A method of recognizing speech, comprising:

receiving an observable acoustic value that describes a portion of a speech signal for a current time period under consideration;

identifying a predicted acoustic value for a hypothesized phonological unit based on an indexed articulatory dynamics value depending on indexed articulatory dynamics values calculated for at least two previous time periods; and

comparing the observed value to the predicted value to determine a likelihood of the hypothesized phonological unit.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A hidden dynamics value in speech is represented by a higher order, discretized dynamic model, which predicts the discretized dynamic variable that changes over time. Parameters are trained for the model. A decoder algorithm is developed for estimating the underlying phonological speech units in sequence that correspond to the observed speech signal using the higher order, discretized dynamic model.

Citations

17 Claims

1. A method of recognizing speech, comprising:
- receiving an observable acoustic value that describes a portion of a speech signal for a current time period under consideration;
  
  identifying a predicted acoustic value for a hypothesized phonological unit based on an indexed articulatory dynamics value depending on indexed articulatory dynamics values calculated for at least two previous time periods; and
  
  comparing the observed value to the predicted value to determine a likelihood of the hypothesized phonological unit.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The method of claim 1 wherein the indexed articulatory dynamics value comprises a vocal tract resonance (VTR) value.
  - 3. The method of claim 1 wherein identifying a predicted acoustic value comprises:
    - extracting a feature vector from the observed acoustic value; and
      
      applying an articulatory dynamics model, that represents articulatory dynamics hidden in a speech signal, to the feature vector.
  - 4. The method of claim 3 and further comprising:
    - prior to extracting a feature vector, constructing frames from the observed acoustic value, the frames being constructed for individual time periods.
  - 5. The method of claim 4 and further comprising:
    - outputting a selected phonological unit based on a likelihood of the hypothesized phonological unit.

6. A method of training a model for use in recognizing speech described by an observable input value, comprising:
- receiving observable training data indicative of a plurality of different types of speech; and
  
  training model parameters for an articulatory dynamics model that represents articulatory dynamics of speech that vary continuously over time and are represented by discrete values calculated from the observable training data for time periods, the model parameters being trained based on the discrete values of the articulatory dynamics calculated for at least two previous time periods.
- View Dependent Claims (7, 8, 9, 10, 11, 12, 13)
- - 7. The method of claim 6 wherein training model parameters comprises:
    - training the model parameters using expectation-maximization in which values of each parameter are first estimated using forward-backward recursion based on estimations of the articulatory dynamics from at least two previous time periods.
  - 8. The method of claim 7 wherein expectation-maximization further includes re-estimating the model parameters based on a current estimation of the model parameters and estimates of the model parameters from at least two previous time periods.
  - 9. The method of claim 8 wherein training model parameters comprises:
    - training a residual value that compensates for inaccuracy or bias of a predicted value of an acoustic feature vector derived from an acoustic value that describes a portion of a speech signal.
  - 10. The method of claim 8 wherein training model parameters comprises:
    - training a hidden target value indicative of a value of the articulatory dynamics targeted by the training data.
  - 11. The method of claim 8 wherein training model parameters comprises:
    - training a precision parameter indicative of a precision of the value of the articulatory dynamics calculated.
  - 12. The method of claim 8 wherein the articulatory dynamic is represented as a distribution and wherein training model parameters comprises:
    - training a mean of the distribution.
  - 13. The method of claim 8 wherein the articulatory dynamic is represented as a distribution and wherein training model parameters comprises:
    - training a precision of the distribution of the articulatory dynamic.

14. A speech recognition system comprising:
- a generative model modeling articulatory dynamics hidden in an observed speech signal that extends over multiple time periods and mapping the articulatory dynamics to a measurable characteristic of the observed speech signal, the generative model modeling the articulatory dynamics based on discrete values of the articulatory dynamics estimated for at least two previous time periods; and
  
  a decoder, coupled to the generative model, configured to receive an observed value describing at least a portion of the observed speech signal and to select one or more hypothesized phonological units based on the measurable characteristic output by the generative model, corresponding to the observed value, and based on the observed value.
- View Dependent Claims (15, 16, 17)
- - 15. The speech recognition system of claim 14 wherein the decoder is configured to select the one or more hypothesized phonological units by identifying a sequence of articulatory dynamics and a corresponding sequence of measurable characteristics using the generative model and selecting a sequence of hypothesized phonological units based on the sequence of measurable characteristics identified.
  - 16. The speech recognition system of claim 14 and further comprising:
    - a training component training parameters of the generative model based on training data indicative of speech having different types of articulatory dynamics.
  - 17. The speech recognition system of claim 16 wherein the training component is configured to train the parameters of the generative model based on indexed articulatory dynamic values calculated from the training data and being of at least a second order.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Deng, Li

Granted Patent

US 7,680,663 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/256
CPC Class Codes

G10L 15/02 Feature extraction for spee...

G10L 2015/025 Phonemes, fenemes or fenone...

Using a discretized, higher order representation of hidden dynamic variables for speech recognition

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

17 Claims

Specification

Solutions

Use Cases

Quick Links

Using a discretized, higher order representation of hidden dynamic variables for speech recognition

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

17 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links