Speech recognition method and apparatus
First Claim
1. In a speech analysis system for recognizing at least one predetermined keyword in an audio signal, each keyword being characterized by a template having at least one target pattern, and each target pattern representing at least one short-term power spectrum, and each target pattern having a minimum dwell time duration and a maximum dwell time duration, the method comprising the steps of:
- forming at a repetitive frame rate, a sequence of frame patterns from and representing said audio signal, each frame pattern being associated with a frame time, said frame rate corresponding to a frame interval less than one-half the minimum dwell time duration,generating, for each frame pattern, a numerical measure of the similarity of said each frame pattern with ones of said target patterns,accumulating, for each frame time and each keyword, and using said numerical measures and said minimum and maximum dwell times, a numerical word score representing the likelihood that a said keyword ended at a said frame time,said accumulating step including the step of accumulating, for each keyword, the numerical measures for each of a continuous sequence of said repetitively formed frame patterns, starting with the numerical measure of the similarity of a present frame pattern and a last target pattern of said keyword, andgenerating at least a preliminary keyword recognition decision whenever the numerical word score for a keyword exceeds a predetermined recognition level.
9 Assignments
0 Petitions
Accused Products
Abstract
A speech recognition method and apparatus for detecting and recognizing one or more keywords in a continuous audio signal are disclosed. Each keyword is represented by a keyword template which corresponds to a sequence of plural target patterns, and each target pattern comprises statistics representing each of a plurality of spectra selected from plural short-term spectra generated according to a predetermined system for processing the incoming audio. The target patterns also have associated therewith minimum and maximum dwell times. The dwell time is the time interval during which a given target pattern can be said to match incoming frame patterns. The spectra are processed to enhance the separation between the spectral pattern classes during later analysis. The processed audio spectra are grouped into multi-frame spectral patterns and each multi-frame spectral pattern is compared by means of likelihood statistics with the target patterns of keyword templates. Each formed multi-frame pattern is then forced to contribute to the total word score for each keyword as represented by the keyword template. Thus the keyword recognition method requires all input patterns to contribute to the word score of a keyword candidate, using the minimum and maximum dwell times for testing whether a target pattern can still match an input pattern, and wherein the frame rate of the audio spectra must be less than one-half the minimum dwell time of a target pattern. A concatentation technique employing a loosely set detection threshold makes it very unlikely that a correct pattern will be rejected. A method for forming the target patterns is also described.
64 Citations
16 Claims
-
1. In a speech analysis system for recognizing at least one predetermined keyword in an audio signal, each keyword being characterized by a template having at least one target pattern, and each target pattern representing at least one short-term power spectrum, and each target pattern having a minimum dwell time duration and a maximum dwell time duration, the method comprising the steps of:
-
forming at a repetitive frame rate, a sequence of frame patterns from and representing said audio signal, each frame pattern being associated with a frame time, said frame rate corresponding to a frame interval less than one-half the minimum dwell time duration, generating, for each frame pattern, a numerical measure of the similarity of said each frame pattern with ones of said target patterns, accumulating, for each frame time and each keyword, and using said numerical measures and said minimum and maximum dwell times, a numerical word score representing the likelihood that a said keyword ended at a said frame time, said accumulating step including the step of accumulating, for each keyword, the numerical measures for each of a continuous sequence of said repetitively formed frame patterns, starting with the numerical measure of the similarity of a present frame pattern and a last target pattern of said keyword, and generating at least a preliminary keyword recognition decision whenever the numerical word score for a keyword exceeds a predetermined recognition level. - View Dependent Claims (2, 3, 4)
-
-
5. In a speech analysis system for recognizing at least one predetermined keyword in an audio signal, each keyword being characterized by a template having at least one target pattern, and each target pattern representing at least one short-term power spectrum, and each target pattern having a minimum dwell time duration and a maximum dwell time duration, the improvement comprising
means for forming at a repetitive frame rate, a sequence of frame patterns from and representing said audio signal, each frame pattern being associated with a frame time, said frame rate corresponding to a frame interval wherein each target pattern has associated therewith at least two frame patterns, means for generating, for each frame pattern, a numerical measure of the similarity of said each frame pattern with selected ones of said target patterns, means for accumulating, for each frame time and each keyword, and using said numerical measures, a numerical word score representing the likelihood that a said keyword ended at a said frame time, said accumulating means including means for accumulating, for each keyword, the numerical measure for each of a continuous sequence of said repetitively formed frame patterns, starting with the numerical measure of the similarity of a present frame pattern and a last target pattern of said keyword, and means for generating at least a preliminary keyword recognition decision when the numerical value for a keyword exceeds a predetermined recognition level.
-
9. In a speech analysis apparatus for recognizing at least one keyword in an audio signal, each keyword being characterized by a template having at least one target pattern, each target pattern representing at least one short-term power spectrum, and each target pattern having associated therewith a plurality of sequential dwell time positions, including at least one required dwell time position and at least one optional dwell time position, the number of said required and optional dwell time positions being a measure of the minimum and maximum time duration of a target pattern, the recognition method comprising the steps of:
-
forming at a repetitive frame time, a sequence of frame patterns from and representing said audio signal, generating a numerical measure of the similarity of each said frame pattern with each of said target patterns, accumulating for any target pattern second and later required dwell time position, and for each target pattern optional dwell time position, the sum of the accumulated score for the previous target pattern dwell time position during the previous frame time and the numerical measure associated with the target pattern during the present frame time, accumulating, for each keyword first target pattern, first required dwell time position, the sum of the score of the first dwell time position during the previous frame time, and the present numerical measure associated with the keyword first target pattern, accumulating, for each other target pattern first required dwell time position, the sum of the best ending accumulated score for the previous target pattern of the same keyword and the present numerical measure associated with the target pattern, and generating a recognition decision, based upon accumulating values of the possible word endings of the last target pattern of each keyword. - View Dependent Claims (10, 11)
-
-
12. An apparatus for recognizing at least one keyword in an audio speech signal, each keyword being characterized by a template having at least one target pattern, each pattern representing at least one short term power spectrum, and each target pattern having a plurality of sequential dwell time positions including at least one required dwell time position and at least one optional dwell time position, the number of said required and optional dwell time positions being a measure of the minimum and maximum time duration of a target pattern, the recognition apparatus comprising,
means for forming, at a repetitive frame time rate, a sequence of frame patterns from, and representing, said audio signal, means for generating a numerical measure of the similarity of each said frame pattern with each of said target patterns, first means for accumulating for any target pattern second and later required dwell time position and each target pattern optional dwell time position, the sum of the accumulated score for the previous target pattern dwell time position during the previous frame time and the numerical measure associated with the target pattern during the present frame time, second means for accumulating, for each keyword first target pattern, first required dwell time position, the sum of the score of the first time position during the previous frame time and the numerical measure associated with the keyword first target pattern during the present frame time, third means for accumulating, for each other first target pattern, first required dwell time position, the sum of the best ending accumulated score for the previous target pattern of the same keyword and the numerical measure associated with the target pattern during the present frame time, means for generating a recognition decision, based upon the accumulated numerical values, when a predetermined sequence occurs in said audio signal.
-
15. In a speech analysis apparatus for recognizing at least one keyword in an audio signal, each keyword being characterized by a template having at least one target pattern, each target pattern representing at least one short-term power spectrum, and each target pattern having associated therewith at least one required dwell time position and at least one optional dwell time position, the number of said required and optional dwell time positions being the measure of a minimum and maximum time duration of a target pattern, a method for forming reference patterns representing said keywords comprising the steps of:
-
dividing an incoming audio signal corresponding to a keyword into a plurality of subintervals, matching each subinterval to a unique reference pattern, making a second pass through said audio input signals representing said keyword for providing machine generated subintervals for said keywords, determining the interval durations for each subinterval, repeating said steps upon a plurality of audio input signals representing the same keyword, generating statistics describing the reference pattern durations associated with each subinterval, and determining the minimum and maximum dwell times for each reference pattern from said assembled statistics. - View Dependent Claims (16)
-
Specification