Segment-based apparatus and method for speech recognition by analyzing multiple speech unit frames and modeling both temporal and spatial correlation
First Claim
1. In a digital processor, speech recognition apparatus for decoding an input speech signal to a corresponding speech unit, the apparatus comprising:
- a source providing an input speech signal formed of multiple observation frames;
a plurality of unit templates, each unit template for representing acoustic attributes of a respective speech unit and each unit template generating a respective synthetic segment indicative of the respective speech unit;
a plurality of error models associated with the unit templates, each unit template having an error model for explicitly measuring and quantitatively representing temporal and spatial correlations between the synthetic segments and a subject speech signal, the temporal and spatial correlations being between acoustic attributes in the observation frames of the subject speech signal; and
processor means coupled to the unit templates and error models and coupled to the source to receive the input speech signal, the processor means comparing the synthetic segments to different plural observation frames of the input speech signal to define a set of error sequences and based on the error models, the processor means analyzing the error sequences and determining the corresponding speech unit of the input speech signal.
2 Assignments
0 Petitions
Accused Products
Abstract
Phonetic recognition is provided by capturing dynamical behavior and statistical dependencies of the acoustic attributes used to represent a subject speech waveform. A segment based framework is employed. Temporal behavior is modelled explicitly by creating dynamic templates, called tracks, of the acoustic attributes used to represent the speech waveform, and by generating the estimation of the acoustic spatio-temporal correlation structure. An error model represents this estimation as the temporal and spatial correlations between the input speech waveform and track generated speech segment. Models incorporating these two components (track and error estimation) are created for both phonetic units and for phonetic transitions. Phonetic contextual influences are accounted for by merging context-dependent tracks and pooling error statistics over the different contexts. This allows for a large number of contextual models without compromising the robustness of the statistical parameter estimates. The transition models also supply contextual information.
102 Citations
26 Claims
-
1. In a digital processor, speech recognition apparatus for decoding an input speech signal to a corresponding speech unit, the apparatus comprising:
-
a source providing an input speech signal formed of multiple observation frames; a plurality of unit templates, each unit template for representing acoustic attributes of a respective speech unit and each unit template generating a respective synthetic segment indicative of the respective speech unit; a plurality of error models associated with the unit templates, each unit template having an error model for explicitly measuring and quantitatively representing temporal and spatial correlations between the synthetic segments and a subject speech signal, the temporal and spatial correlations being between acoustic attributes in the observation frames of the subject speech signal; and processor means coupled to the unit templates and error models and coupled to the source to receive the input speech signal, the processor means comparing the synthetic segments to different plural observation frames of the input speech signal to define a set of error sequences and based on the error models, the processor means analyzing the error sequences and determining the corresponding speech unit of the input speech signal. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. In a digital processor, a method for decoding an input speech signal to a corresponding speech unit comprising the steps of:
-
providing an input speech signal formed of multiple observation frames; providing a plurality of unit templates in stored memory of the digital processor, each unit template for representing acoustic attributes of a respective speech unit and for generating a respective target speech unit; providing a plurality of error models associated with the unit templates in stored memory, each unit template having an error model for explicitly measuring and quantitatively representing temporal and spatial correlations between the synthetic segments and a subject speech signal, the temporal and spatial correlations being between acoustic attributes in the observation frames of the subject speech signal; receiving the input speech signal in working memory of the digital processor; comparing the target speech units with different plural observation frames of the input speech signal in working memory such that the comparison defines a set of error sequences in working memory; and using the error models, analyzing the error sequences and determining the corresponding speech unit of the input speech signal. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23)
-
-
24. In a digital processor, speech recognition apparatus for decoding an input speech signal to a corresponding speech unit, the apparatus comprising:
-
a source providing an input speech signal formed of multiple observation frames; a plurality of unit templates, each unit template for representing acoustic attributes of a respective speech unit and each unit template generating a respective synthetic segment indicative of the respective speech unit; a plurality of error models associated with the unit templates, each unit template having an error model; and processor means coupled to the unit templates and error models and coupled to the source to receive the input speech signal, the processor means comparing the synthetic segments to different plural observation frames of the input speech signal to define a set of error sequences, the processor means transforming each error sequence to a fixed dimension error feature vector independent of the number of observation frames, and based on the error models, the processor means computing a score for the error feature vector. - View Dependent Claims (25, 26)
-
Specification