Segment-based apparatus and method for speech recognition by analyzing multiple speech unit frames and modeling both temporal and spatial correlation

US 5,625,749 A
Filed: 08/22/1994
Issued: 04/29/1997
Est. Priority Date: 08/22/1994
Status: Expired due to Fees

First Claim

Patent Images

1. In a digital processor, speech recognition apparatus for decoding an input speech signal to a corresponding speech unit, the apparatus comprising:

a source providing an input speech signal formed of multiple observation frames;

a plurality of unit templates, each unit template for representing acoustic attributes of a respective speech unit and each unit template generating a respective synthetic segment indicative of the respective speech unit;

a plurality of error models associated with the unit templates, each unit template having an error model for explicitly measuring and quantitatively representing temporal and spatial correlations between the synthetic segments and a subject speech signal, the temporal and spatial correlations being between acoustic attributes in the observation frames of the subject speech signal; and

processor means coupled to the unit templates and error models and coupled to the source to receive the input speech signal, the processor means comparing the synthetic segments to different plural observation frames of the input speech signal to define a set of error sequences and based on the error models, the processor means analyzing the error sequences and determining the corresponding speech unit of the input speech signal.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Phonetic recognition is provided by capturing dynamical behavior and statistical dependencies of the acoustic attributes used to represent a subject speech waveform. A segment based framework is employed. Temporal behavior is modelled explicitly by creating dynamic templates, called tracks, of the acoustic attributes used to represent the speech waveform, and by generating the estimation of the acoustic spatio-temporal correlation structure. An error model represents this estimation as the temporal and spatial correlations between the input speech waveform and track generated speech segment. Models incorporating these two components (track and error estimation) are created for both phonetic units and for phonetic transitions. Phonetic contextual influences are accounted for by merging context-dependent tracks and pooling error statistics over the different contexts. This allows for a large number of contextual models without compromising the robustness of the statistical parameter estimates. The transition models also supply contextual information.

102 Citations

View as Search Results

26 Claims

1. In a digital processor, speech recognition apparatus for decoding an input speech signal to a corresponding speech unit, the apparatus comprising:
- a source providing an input speech signal formed of multiple observation frames;
  
  a plurality of unit templates, each unit template for representing acoustic attributes of a respective speech unit and each unit template generating a respective synthetic segment indicative of the respective speech unit;
  
  a plurality of error models associated with the unit templates, each unit template having an error model for explicitly measuring and quantitatively representing temporal and spatial correlations between the synthetic segments and a subject speech signal, the temporal and spatial correlations being between acoustic attributes in the observation frames of the subject speech signal; and
  
  processor means coupled to the unit templates and error models and coupled to the source to receive the input speech signal, the processor means comparing the synthetic segments to different plural observation frames of the input speech signal to define a set of error sequences and based on the error models, the processor means analyzing the error sequences and determining the corresponding speech unit of the input speech signal.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. Apparatus as claimed in claim 1 wherein the unit templates employ a generation function to generate the synthetic segments.
  - 3. Apparatus as claimed in claim 2 wherein the generation function is used to form each unit template.
  - 4. Apparatus as claimed in claim 1 wherein each error model is formed from a probability density function;
    - andthe processor means determines the corresponding speech unit of the input speech signal to be the respective speech unit of the unit template corresponding to the most likely error model.
  - 5. Apparatus as claimed in claim 1 wherein each error model is formed from a distance metric;
    - andthe processor means determines the corresponding speech unit of the input speech signal to be the respective speech unit of the unit template corresponding to the best error model.
  - 6. Apparatus as claimed in claim 1 wherein each error sequence is normalized to a single error feature vector of fixed dimension before the processor means generates the error models.
  - 7. Apparatus as claimed in claim 1 wherein the plurality of unit templates includes transition unit templates for representing acoustic transition dynamics between speech units within a speech signal.
  - 8. Apparatus as claimed in claim 7 wherein the transition unit templates provide an indication of one of location of a transition in the input speech signal and speech units involved in the transition.
  - 9. Apparatus as claimed in claim 1 further comprising a multiplicity of merged templates formed by a combination of a plurality of unit templates.
  - 10. Apparatus as claimed in claim 1 wherein certain ones of the unit templates are templates for representing context-dependent acoustic attributes of a respective speech unit.
  - 11. Apparatus as claimed in claim 1 wherein the respective speech unit for each unit template is a phonetic unit or a string of phonetic units.

12. In a digital processor, a method for decoding an input speech signal to a corresponding speech unit comprising the steps of:
- providing an input speech signal formed of multiple observation frames;
  
  providing a plurality of unit templates in stored memory of the digital processor, each unit template for representing acoustic attributes of a respective speech unit and for generating a respective target speech unit;
  
  providing a plurality of error models associated with the unit templates in stored memory, each unit template having an error model for explicitly measuring and quantitatively representing temporal and spatial correlations between the synthetic segments and a subject speech signal, the temporal and spatial correlations being between acoustic attributes in the observation frames of the subject speech signal;
  
  receiving the input speech signal in working memory of the digital processor;
  
  comparing the target speech units with different plural observation frames of the input speech signal in working memory such that the comparison defines a set of error sequences in working memory;
  
  andusing the error models, analyzing the error sequences and determining the corresponding speech unit of the input speech signal.
- View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23)
- - 13. A method as claimed in claim 12 wherein the unit templates employ a generation function to generate the target speech units.
  - 14. A method as claimed in claim 13 wherein the generation function is used to form each unit template.
  - 15. A method as claimed in claim 12 wherein:
    - the step of generating the error models includes forming each error model from a probability density function; and
      
      the step of determining the corresponding speech unit includes determining a most likely error model such that the respective speech unit of the unit template corresponding to the most likely error model is the corresponding speech unit of the input speech signal.
  - 16. A method as claimed in claim 12 wherein:
    - the step of generating the error models includes forming each error model from a distance metric; and
      
      the step of determining the corresponding speech unit includes determining a best error model, such that the respective speech unit of the unit template corresponding to the best error model is the corresponding speech unit of the input speech signal.
  - 17. A method as claimed in claim 12 further comprising the step of normalizing each error sequence to a single error feature vector of fixed dimension before generating the error models.
  - 18. A method as claimed in claim 17 wherein the step of normalizing includes averaging across each error sequence.
  - 19. A method as claimed in claim 12 wherein the step of providing a plurality of unit templates includes providing transition unit templates for representing acoustic transition dynamics between speech units within a speech signal.
  - 20. A method as claimed in claim 19 wherein the transition unit templates provide an indication of one of location of a transition in the input speech signal and speech units involved in the transition.
  - 21. A method as claimed in claim 12 wherein the step of providing a plurality of unit templates includes combining a plurality of unit templates to form a multiplicity of merged templates that account for contextual effects on the respective speech units of the unit templates.
  - 22. A method as claimed in claim 12 wherein the step of providing a plurality of unit templates includes providing a multiplicity of templates for representing context dependent acoustic attributes of a respective speech unit.
  - 23. A method as claimed in claim 12 wherein the step of providing a plurality of unit templates includes providing phonetic unit templates for representing one of phonetic units of speech and strings of phonetic units of speech.

24. In a digital processor, speech recognition apparatus for decoding an input speech signal to a corresponding speech unit, the apparatus comprising:
- a source providing an input speech signal formed of multiple observation frames;
  
  a plurality of unit templates, each unit template for representing acoustic attributes of a respective speech unit and each unit template generating a respective synthetic segment indicative of the respective speech unit;
  
  a plurality of error models associated with the unit templates, each unit template having an error model; and
  
  processor means coupled to the unit templates and error models and coupled to the source to receive the input speech signal, the processor means comparing the synthetic segments to different plural observation frames of the input speech signal to define a set of error sequences, the processor means transforming each error sequence to a fixed dimension error feature vector independent of the number of observation frames, and based on the error models, the processor means computing a score for the error feature vector.
- View Dependent Claims (25, 26)
- - 25. The apparatus of claim 24 wherein each error model explicitly measures and quantitatively represents temporal and spatial correlations between the synthetic segments and a subject speech signal, the temporal and spatial correlations being between acoustic attributes in the observation frames of the subject speech signal.
  - 26. The apparatus of claim 25 wherein the temporal and spatial correlations are between different acoustic attributes in different observation frames of the subject speech signal.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Massachusetts Institute of Technology
Original Assignee
Massachusetts Institute of Technology
Inventors
Glass, James R., Goldenthal, William D.
Primary Examiner(s)
MacDonald, Allen R.
Assistant Examiner(s)
SMITS, TALIVALDIS IVARS

Application Number

US08/293,584
Time in Patent Office

981 Days
Field of Search

395/2.55, 395/2.6, 395/2.64, 395/2.62, 395/2.46, 395/2.48, 395/2.49, 395/2.5, 395/2.51
US Class Current

704/254
CPC Class Codes

G10L 15/02   Feature extraction for spee...

G10L 15/148   Duration modelling in HMMs,...

G10L 2015/025   Phonemes, fenemes or fenone...

Segment-based apparatus and method for speech recognition by analyzing multiple speech unit frames and modeling both temporal and spatial correlation

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

102 Citations

26 Claims

Specification

Solutions

Use Cases

Quick Links

Segment-based apparatus and method for speech recognition by analyzing multiple speech unit frames and modeling both temporal and spatial correlation

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

102 Citations

26 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links