Continuous speech recognition method
First Claim
1. In a speech analysis system for recognizing at least one predetermined keyword in an audio signal, each said keyword being characterized by a template having at least one target pattern, said target patterns having an ordered sequence and each target pattern representing at least one short term power spectrum, an analysis method comprising the steps ofrepeatedly evaluating electrical signals representing a set of parameters determining a short-term power spectrum of said audio signal within each of a plurality of equal duration sampling intervals, thereby to generate a continuous time ordered sequence of short-term audio power spectrum frames,repeatedly generating electrical signals representing a peak spectrum corresponding to said short-term power spectrum frames by a fast attack, slow decay peak detecting function, andfor each short-term power spectrum frame, dividing the amplitude of each frequency band by the corresponding intensity value in the corresponding peak spectrum, thereby to generate a frequency band equalized spectrum frame corresponding to a compensated audio signal having the same maximum short-term energy content in each of the frequency bands comprising the frame, andidentifying electrical signals representing a candidate keyword template when said selected multi-frame patterns correspond respectively to the target patterns of a said keyword template.
6 Assignments
0 Petitions
Accused Products
Abstract
A speech recognition method for detecting and recognizing one or more keywords in a continuous audio signal is disclosed. Each keyword is represented by a keyword template representing one or more target patterns, and each target pattern comprises statistics of each of at least one spectrum selected from plural short-term spectra generated according to a predetermined system for processing of the incoming audio. The spectra are processed by a frequency equalization and normalizing method to enhance the separation between the spectral pattern classes during later analysis. The processed audio spectra are grouped into spectral patterns, are transformed to reduce dimensionality of the patterns, and are compared by means of likelihood statistics with the target patterns of the keyword templates. A concatenation technique employing a loosely set detection threshold makes it very unlikely that a correct pattern will be rejected.
85 Citations
14 Claims
-
1. In a speech analysis system for recognizing at least one predetermined keyword in an audio signal, each said keyword being characterized by a template having at least one target pattern, said target patterns having an ordered sequence and each target pattern representing at least one short term power spectrum, an analysis method comprising the steps of
repeatedly evaluating electrical signals representing a set of parameters determining a short-term power spectrum of said audio signal within each of a plurality of equal duration sampling intervals, thereby to generate a continuous time ordered sequence of short-term audio power spectrum frames, repeatedly generating electrical signals representing a peak spectrum corresponding to said short-term power spectrum frames by a fast attack, slow decay peak detecting function, and for each short-term power spectrum frame, dividing the amplitude of each frequency band by the corresponding intensity value in the corresponding peak spectrum, thereby to generate a frequency band equalized spectrum frame corresponding to a compensated audio signal having the same maximum short-term energy content in each of the frequency bands comprising the frame, and identifying electrical signals representing a candidate keyword template when said selected multi-frame patterns correspond respectively to the target patterns of a said keyword template.
-
3. In a speech analysis system in which an audio signal is spectrum analyzed for recognizing at least one predetermined keyword in a continuous audio signal, each said keyword being characterized by a template having at least one target pattern, said target patterns having an ordered sequence and each target pattern representing a plurality of short-term power spectra spaced apart in real time, an analysis method comprising the steps of
repeatedly evaluating electrical signals representing a set of parameters determining a short-term power spectrum of said audio signal with each of a plurality of equal duration sampling intervals, thereby to generate a continuous time ordered sequence of short-term audio power spectrum frames, repeatedly generating electrical signals representing a peak spectrum corresponding to said short-term power spectrum frames by a fast attack, slow decay peak detecting function, for each short-term power spectrum frame, dividing the amplitude of each frequency band by the corresponding intensity value in the corresponding peak spectrum, thereby to generate a frequency band equalized spectrum frame corresponding to a compensated audio signal having the same maximum short-term energy content in each of the frequency bands comprising the spectrum, repeatedly selecting from said sequence of equalized frames, one first frame and at least one later occurring frame to form a multi-frame pattern, comparing each thus formed multi-frame pattern with each first target pattern of each keyword template, deciding whether each said multi-frame pattern corresponds to a said first target pattern of a keyword template, for each multi-frame pattern which, according to said deciding step, corresponds to a said first target pattern of a potential candidate keyword, selecting later occurring short-term power spectrum equalized frames to form later occurring multiframe patterns, deciding whether said later occurring multi-frame patterns correspond respectively to successive target patterns of said potential candidate keyword template, and identifying electrical signals representing a candidate keyword template when said selected multi-frame patterns correspond respectively to the target patterns of a said keyword template.
-
5. In a pattern recognition system for identifying in a data stream electrical signals representing at least one target pattern characterized by a vector of recognition elements xi, said elements having a statistical distribution, the analysis method comprising the steps of
determining from plural design set pattern samples x of the target pattern a covariance matrix K; -
determining from said plural design set patterns an expected value vector x; calculating from the covariance matrix K, a plurality of eigenvectors ei having eigenvalues vi where vi ≧
vi +1;selecting electrical signals representing unknown patterns y from said data stream; transforming the electrical signals representing each pattern y into electrical signals representing a new vector (W1, W2, . . . , Wp, R) where
space="preserve" listing-type="equation">W.sub.i =e.sub.i (y-x),p is a positive integer constant less than the number of elements of the pattern y, and R is the reconstruction error statistic and equals ##EQU17## and deciding, by applying a likelihood statistic function to electrical signals representing said new vector (W1, W2, . . . , Wp, R), whether said pattern y is identified with the target pattern. - View Dependent Claims (6, 7)
-
-
8. In a speech analysis system for recognizing at least one predetermined keyword in a continuous audio signal, each said keyword being characterized by a template having at least one target pattern, said target patterns having an ordered sequence and each target pattern representing a plurality of short-term power spectra spaced apart in real time, an analysis method comprising the steps of
for each target pattern, determining from electrical signals representing plural design set pattern samples x of a said target pattern having elements xi, a covariance matrix K; -
determining from said plural design set patterns an expected value vector x; calculating from the covariance matrix K a plurality of eigenvectors ei having eigenvalues vi where vi ≧
vi +1;repeatedly evaluating electrical signals representing a set of parameters determining a short-term power spectrum of said audio signal within each of a plurality of equal duration sampling intervals, thereby to generate a continuous time ordered sequence of short-term audio power spectrum frames; repeatedly selecting from said sequence of frames, electrical signals representing one first frame and at least one later occurring frame to form a multi-frame pattern y; transforming the electrical signals representing each multi-frame pattern y into electrical signals representing new vectors W, represented as (W1, W2, . . ., Wp, R), where
space="preserve" listing-type="equation">W.sub.i =e.sub.i (y-x);p is a positive integer constant less than the number of elements of the pattern y, and R is the reconstruction error statistic and equals ##EQU20## deciding whether each said transformed pattern corresponds to a said first target pattern of a keyword template; for each pattern which, according to said deciding step, corresponds to a said first target pattern of a potential candidate keyword, selecting later occurring short-term power spectra to form later occurring multi-frame patterns; deciding whether said later occurring multi-frame patterns correspond respectively to successive target patterns of said potential candidate keyword template; and identifying electrical signals representing a candidate keyword template when said selected multi-frame patterns correspond respectively to the target patterns of a said keyword template. - View Dependent Claims (9, 10, 11, 12)
-
-
13. In a pattern recognition system for identifying in a data stream electrical signals representing at least one target pattern characterized by a vector of recognition elements xi, said elements having a statistical distribution, the analysis method comprising the steps of
selecting electrical signals representing unknown patterns y from said data stream; -
transforming the electrical signals representing each pattern y into electrical signals representing a new vector (W1, W2, . . . , Wp, R) where the Wi represent elements of a vector obtained in a principle component analysis, p is a positive integer constant less than the number of elements of the pattern y, and R is a reconstruction error statistic representing information at least in part not contained in said elements Wi ; and deciding, by applying a likelihood statistic function to electrical signals representing said new vector (W1, W2, . . . , Wp, R), whether said pattern y is identified with the target pattern.
-
-
14. In a speech analysis system for recognizing at least one predetermined keyword in a continuous audio signal, each said keyword being characterized by a template having at least one target pattern, said target patterns having an ordered sequence and each target pattern representing a plurality of short-term power spectra spaced apart in real time, an analysis method comprising the steps of
for each target pattern, determining from electrical signals representing plural design set pattern samples x of a said target pattern and said samples x having elements xi ; -
repeatedly evaluating electrical signals representing a set of parameters determining a short-term power spectrum of said audio signal within each of a plurality of equal duration sampling intervals, thereby to generate a continuous time ordered sequence of short-term audio power spectrum frames; repeatedly selecting from said sequence of frames, electrical signals representing one first frame and at least one later occurring frame to form a multi-frame pattern y; transforming the electrical signals representing each multi-frame pattern y into electrical signals representing new vectors W, represented as (W1, W2, . . . , Wp, R), where Wi represent elements of a vector obtained in a principle component analysis, p is a positive integer constant less than the number of elements of the pattern y, and R is a reconstruction error statistic representing at least in part information not contained in said elements Wi ; deciding whether each said transformed pattern corresponds to a said first target pattern of a keyword template; for each pattern which, according to said deciding step, corresponds to a said first target pattern of a potential candidate keyword, selecting later occurring short-term power spectra to form later occurring multi-frame patterns; deciding whether said later occurring multi-frame patterns correspond respectively to successive target patterns of said potential candidate keyword template; and identifying electrical signals representing a candidate keyword template when said selected multi-frame patterns correspond respectively to the target patterns of a said keyword template.
-
Specification