Method and apparatus for continuous word string recognition
First Claim
1. In a speech analysis apparatus for recognizing at least one keyword in an audio signal, each keyword being characterized by a template having at least one target pattern, each target pattern representing at least two short-term power spectra, and each target pattern having associated therewith at least two required dwell time positions and at least one optional dwell time position, the recognition method comprising the steps of:
- forming at a repetitive frame time, a sequence of input frame patterns from and representing said audio signal, each frame pattern being associated with a frame time, successive frame patterns corresponding to successive dwell time positions,generating a numerical measure of the similarity of each said frame pattern with each of said target patterns,accumulating for each said target pattern required dwell time position and each said target pattern optional dwell time position, and using said numerical measure of the similarity of the just formed frame pattern and said each target pattern, a numerical value representing the alignment of the just formed frame pattern with the respective target pattern dwell time position, andgenerating a recognition decision, based upon said numerical values, when a predetermined sequence occurs in said audio signal.
9 Assignments
0 Petitions
Accused Products
Abstract
A speech recognition method and apparatus for recognizing word strings in a continuous audio signal are disclosed. The word strings are made up of a plurality of elements, and each element, generally a word, is represented by an element template defined by a plurality of target patterns. Each target pattern is represented by a plurality of statistics describing the expected behavior of a group of spectra selected from plural short-term spectra generated by processing of the incoming audio. Each target pattern has associated therewith at least one required dwell time position and at least one optional dwell time position. The number of required dwell time positions and the sum of the required and optional dwell time positions define, in effect, the limits of a time interval during which a given target pattern can be said to match an incoming sequence of frame patterns. The incoming audio spectra are processed to enhance the separation between the spectral pattern classes during later analysis. The processed audio spectra are grouped into multi-frame spectral patterns and are compared, using likelihood statistics, with the target patterns of the element templates. Each multi-frame pattern input, which inputs occur at a frame rate which requires each keyword target pattern to correspond to at least two of the multi-frame patterns, is forced to contribute to each of a plurality of pattern scores as represented by the element templates. The contributions of said multi-frame pattern inputs to said pattern scores is controlled, in part, by said required and optional dwell time constraints. A concatenation technique is employed, using dynamic programming techniques, to determine the correct identity of the word string.
83 Citations
16 Claims
-
1. In a speech analysis apparatus for recognizing at least one keyword in an audio signal, each keyword being characterized by a template having at least one target pattern, each target pattern representing at least two short-term power spectra, and each target pattern having associated therewith at least two required dwell time positions and at least one optional dwell time position, the recognition method comprising the steps of:
-
forming at a repetitive frame time, a sequence of input frame patterns from and representing said audio signal, each frame pattern being associated with a frame time, successive frame patterns corresponding to successive dwell time positions, generating a numerical measure of the similarity of each said frame pattern with each of said target patterns, accumulating for each said target pattern required dwell time position and each said target pattern optional dwell time position, and using said numerical measure of the similarity of the just formed frame pattern and said each target pattern, a numerical value representing the alignment of the just formed frame pattern with the respective target pattern dwell time position, and generating a recognition decision, based upon said numerical values, when a predetermined sequence occurs in said audio signal. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. An apparatus for recognizing at least one keyword in an audio speech signal, each keyword being characterized by a template having at least one target pattern, each pattern representing at least two short term power spectra, and each target pattern having associated therewith at least two required dwell time positions and at least one optional dwell time position, the recognition apparatus comprising,
means for forming, at a repetitive frame time rate, a sequence of input frame patterns from, and representing, said audio signal, each frame pattern corresponding to a said frame time, and successive frame patterns corresponding to successive dwell time positions, means for generating a numerical measure of the similarity of each said frame pattern with each of said target patterns, means for accumulating, for each said target pattern required dwell time position and each said target pattern optional dwell time position, and using said numerical measure of the similarity of the just formed frame pattern and said each target pattern, a numerical value representing the alignment of the just formed audio representing frame pattern with the respective target pattern dwell time position, and means for generating a recognition decision, based upon the accumulated numerical values, when a predetermined sequence occurs in said audio signal.
-
15. In a speech analysis apparatus for recognizing at least one keyword in an audio signal, each keyword being characterized by a template having at least one target pattern, each target pattern representing at least two short-term power spectra, and each target pattern having associated therewith at least two required dwell time positions and at least one optional dwell time position, said dwell time positions defining the limits during which a said target pattern can match an incoming sequence of frame patterns, a method for forming said target patterns representing said keywords comprising the steps of:
-
dividing an incoming audio signal corresponding to a keyword into a plurality of subintervals, forcing each subinterval to correspond to a unique target pattern, repeating said dividing and forcing steps upon a plurality of audio input signals representing the same keyword, generating statistics describing the target pattern associated with each subinterval, and making a second pass through said audio input signals representing said keyword, using said assembled statistics, for providing machine generated subintervals for said keywords. - View Dependent Claims (16)
-
Specification