Continuous speech recognition
First Claim
1. In a speech analysis apparatus for recognizing at least one keyword in an audio signal, a method for recognizing silence in the incoming audio signal comprising the steps of:
- generating at least first and second target templates, each template representing, as a sequence of frequency spectrum representing parameters, an alternate description of silence in said incoming audio signal,comparing said incoming audio signal with each of said first and second target templates,generating a first and a second numerical measure representing the result of said comparisons respectively, anddeciding, based at least upon said numerical measures, whether silence has been detected.
10 Assignments
0 Petitions
Accused Products
Abstract
An improved speech recognition method and apparatus for recognizing keywords in a continuous audio signal are disclosed. The keywords, generally either a word or a string of words, are each represented by an element template defined by a plurality of target patterns. Each target pattern is represented by a plurality of statistics describing the expected behavior of a group of spectra selected from plural short-term spectra generated by processing of the incoming audio. The incoming audio spectra are processed to enhance the separation between the spectral pattern classes during later analysis. The processed audio spectra are grouped into multi-frame spectral patterns and are compared, using likelihood statistics, with the target patterns of the element templates. Each multi-frame pattern is forced to contribute to each of a plurality of pattern scores as represented by the element templates. The method and apparatus use speaker independent word models during the training stage to generate, automatically, improved target patterns. The apparatus and method further employ grammatical syntax during the training stage for identifying the beginning and ending boundaries of unknown keywords. Recognition is further improved by use of a plurality of templates representing "silence" or non-speech signals, for example, hum. Also, memory and computation load is reduced by use of modified (collapsed or folded) syntax flow graph logic, implemented by additional (augment) control numbers. A concatenation technique is employed, using dynamic programming techniques, to determine the correct identity of the word string.
121 Citations
19 Claims
-
1. In a speech analysis apparatus for recognizing at least one keyword in an audio signal, a method for recognizing silence in the incoming audio signal comprising the steps of:
-
generating at least first and second target templates, each template representing, as a sequence of frequency spectrum representing parameters, an alternate description of silence in said incoming audio signal, comparing said incoming audio signal with each of said first and second target templates, generating a first and a second numerical measure representing the result of said comparisons respectively, and deciding, based at least upon said numerical measures, whether silence has been detected.
-
-
2. In a speech analysis apparatus for recognizing a plurality of keywords in an audio signal, each keyword being characterized by a template having at least one target pattern and each sequence of said keywords in said audio signal being described by a grammatical syntax, said syntax being characterized by a plurality of connected decision nodes, the recognition apparatus comprising:
-
means for providing a sequence of numerical scores for recognizing keywords in said audio signal employing dynamic programming, means for employing said grammatical syntax for determining which scores form acceptable progressions in the recognition process, and means for using augments to preserve acceptable progressions whereby otherwise acceptable progressions are discarded according to said syntax.
-
-
3. In a speech analysis apparatus for recognizing at least one keyword in an audio signal, a method for recognizing silence in said audio signal comprising the steps of:
-
generating a numerical measure of likelihood that the present incoming audio signal portion corresponds to a reference pattern representing silence, effectively altering the numerical measure according to a syntax dependent determination, said syntax dependent determination representing the recognition of an immediately preceeding portion of the audio signal according to a grammatical syntax, and determining from the effectively altered measure whether the present signal portion corresponds to silence.
-
-
4. In a speech analysis apparatus for recognizing at least one spoken keywork in an audio signal, each keyword being characterized by a template having at least one target pattern, a method for forming reference patterns representing said spoken keywords and tailored to a speaker, comprising the steps of:
-
providing speaker independent reference patterns representing said spoken keywords, determining beginning and ending boundaries of said keywords in audio signals spoken by said speaker using said speaker independent reference patterns, and training the speech analysis apparatus to said speaker using the beginning and ending boundaries determined by said apparatus for said keywords spoken by said speaker. - View Dependent Claims (5)
-
-
6. In a speech analysis apparatus for recognizing at least one spoken keyword in an audio signal, each keyword being characterized by a template having at least one target pattern, a method for forming reference patterns representing a previously unknown keyword comprising the steps of:
-
providing speaker independent reference patterns representing spoken keywords previously known to the apparatus, determining beginning and ending boundaries of said unknown keyword using said speaker independent reference patterns, and training the speech analysis apparatus, using the beginning and ending boundaries previously determined by said apparatus for said previously unknown keyword, to generate statistics describing said previously unknown keyword. - View Dependent Claims (7, 8)
-
-
9. In a speech analysis apparatus for recognizing a plurality of keywords in an audio signal, each keyword being characterized by a template having at least one target pattern and each sequence of said keywords in said audio signal being described by a grammatical syntax, said syntax being characterized by a plurality of connected decision nodes, the recognition method comprising the steps of:
-
providing a sequence of numerical scores for recognizing keywords in said audio signal employing dynamic programming, employing said grammatical syntax for determining which scores form acceptable progressions in the recognition process, and reducing the number of decision nodes by collapsing said syntax whereby the computational load for the apparatus is reduced.
-
-
10. In a speech analysis apparatus for recognizing at least one keyword in an audio signal, apparatus for recognizing silence in the incoming audio signal comprising:
-
means for generating at least first and second target templates, each template representing, as a sequence of frequency spectrum representing parameters, an alternate description of silence in said incoming audio signal, means for comparing said incoming audio signal with each of said first and second target templates, means for generating a first and a second numerical measure representing the result of said comparisons respectively, and means for deciding, based at least upon said numerical measures, whether silence has been detected.
-
-
11. In a speech analysis apparatus for recognizing a plurality of keywords in an audio signal, each keyword being characterized by a template having at least one target pattern and each sequence of said keywords in said audio signal being described by a grammatical syntax, said syntax being characterized by a plurality of connected decision nodes, the recognition method comprising the steps of:
-
providing a sequence of numerical scores for recognizing keywords in said audio signal employing dynamic programming, employing said grammatical syntax for determining which scores form acceptable progressions in the recognition process, and using augments to preserve acceptable progressions whereby otherwise acceptable progressions are discarded according to said syntax.
-
-
12. In a speech analysis apparatus for recognizing at least one keyword in an audio signal, apparatus for recognizing silence in said audio signal comprising:
-
means for generating a numerical measure of likelihood that the present incoming audio signal portion corresponds to a reference pattern representing silence, means for adding to the numerical measure a syntax dependent numerical value to form a score, said syntax dependent value representing the recognition of an immediately preceeding portion of the audio signal according to a grammatical syntax, and means for determining from the score whether the present signal portion corresponds to silence.
-
-
13. In a speech analysis apparatus for recognizing at least one spoken keyword in an audio signal, each keyword being characterized by a template having at least one target pattern, apparatus for forming reference patterns representing said spoken keywords and tailored to a speaker comprising:
-
means for providing speaker independent reference patterns representing said spoken keywords, means for determining beginning and ending boundaries of said keywords in audio signals spoken by said speaker using said speaker independent reference patterns, and means for training the speech analysis apparatus to said speaker using the beginning and ending boundaries determined by said apparatus for said keywords spoken by said speaker. - View Dependent Claims (14)
-
-
15. In a speech analysis apparatus for recognizing at least one spoken keyword in an audio signal, each keyword being characterized by a template having at least one target pattern, apparatus for forming reference patterns representing a previously unknown keyword comprising:
-
means for providing speaker independent reference patterns representing spoken keywords previously known to the apparatus, means for determining beginning and ending boundaries of said unknown keyword using said speaker independent reference patterns, and means for training the speech analysis apparatus using the beginning and ending boundaries previously determined by said apparatus for said unknown keyword to generate statistics describing said previously unknown keyword. - View Dependent Claims (16, 18)
-
-
17. means for providing an audio signal representing said unknown keyword spoken by said speaker in isolation.
-
19. In a speech analysis apparatus for recognizing a plurality of keywords in an audio signal, each keyword being characterized by a template having at least one target pattern and each sequence of said keywords in said audio signal being described by a grammatical syntax, said syntax being characterized by a plurality of connected decision nodes, the recognition apparatus comprising:
-
means for providing a sequence of numerical scores for recognizing keywords in said audio signal employing dynamic programming, means for employing said grammatical syntax for determining which scores form acceptable progressions in the recognition process, and means for reducing the number of decision nodes whereby the computational load for the apparatus is reduced.
-
Specification