Speech recognition arrangement
First Claim
1. In a speech recognizer having a plurality of stored reference pattern templates each comprising a time frame sequence of acoustic spectral parameters of a prescribed reference pattern, a method for processing an input signal to recognize a speech pattern comprisinggenerating a time frame sequence of acoustic spectral parameters from said input signal,generating a time frame sequence of acoustic nonspectral parameters from said input signal,time aligning each of said reference pattern templates with said input signal based on reference pattern and input signal spectral parameters but independent of said nonspectral parameters,determining a set of similarity measures each representative of the similarity between spectral parameters of said input signal and spectral parameters of one of the time aligned reference pattern templates andselectively identifying said speech pattern in said input signal as one of said reference patterns based both on said similarity measures and on said nonspectral parameters,wherein said time aligning comprisesfor each of said reference patterns, pairing time frames of that reference pattern template with time frames of said input signal to maximize the similarity measure determined for that reference pattern, said pairing defining a scan region of input signal time frames for that reference pattern,wherein said selectively identifying comprisesfor each of said reference patterns, adjusting the determined similarity measured based on said nonspectral parameters over the scan region of input signal time frames for that reference pattern andselectively identifying said speech pattern in said input signal as one of said reference patterns based on said adjusted similarity measures.
1 Assignment
0 Petitions
Accused Products
Abstract
A speech recognition arrangement where nonspectral features of the input signal, e.g., energy and voicing parameters, are used to effectively remove non-speech events from consideration but only after a time warping procedure based solely on the input and reference pattern spectral parameters has been completed. The time warping procedure is not unduly complex because there is no need to weight spectral and nonspectral parameters in matching input and reference patterns. For each reference pattern, the time warping procedure defines a scan region of the input signal to be used in evaluating the nonspectral input signal characteristics. Energy and voicing parameters are useful in distinguishing non-speech events since speech patterns typically have few very low-energy frames (other than frames that are part of a gap within a vocabulary item) and more than a minimum number of voiced frames, e.g., frames corresponding to vowel sounds.
-
Citations
11 Claims
-
1. In a speech recognizer having a plurality of stored reference pattern templates each comprising a time frame sequence of acoustic spectral parameters of a prescribed reference pattern, a method for processing an input signal to recognize a speech pattern comprising
generating a time frame sequence of acoustic spectral parameters from said input signal, generating a time frame sequence of acoustic nonspectral parameters from said input signal, time aligning each of said reference pattern templates with said input signal based on reference pattern and input signal spectral parameters but independent of said nonspectral parameters, determining a set of similarity measures each representative of the similarity between spectral parameters of said input signal and spectral parameters of one of the time aligned reference pattern templates and selectively identifying said speech pattern in said input signal as one of said reference patterns based both on said similarity measures and on said nonspectral parameters, wherein said time aligning comprises for each of said reference patterns, pairing time frames of that reference pattern template with time frames of said input signal to maximize the similarity measure determined for that reference pattern, said pairing defining a scan region of input signal time frames for that reference pattern, wherein said selectively identifying comprises for each of said reference patterns, adjusting the determined similarity measured based on said nonspectral parameters over the scan region of input signal time frames for that reference pattern and selectively identifying said speech pattern in said input signal as one of said reference patterns based on said adjusted similarity measures.
-
9. In a speech recognizer having a plurality of stored reference pattern templates each comprising a time frame sequence of acoustic spectral parameters of a prescribed reference pattern, a method for processing an input signal to recognize a speech pattern comprising
generating a time frame sequence of acoustic spectral parameters from said input signal, generating a time frame sequence of voicing parameters from said input signal, each of said voicing parameters defining the presence or absence of a vowel sound determining a set of similarity measures each representative of the similarity between spectral parameters of said input signal and spectral parameters of one of the reference pattern templates and selectively identifying said speech pattern in said input signal as one of said reference patterns based both on said similarity measures and on said voicing parameters.
-
10. A speech recognizer for processing an input signal to recognize a speech pattern comprising
memory means for storing a plurality of reference pattern templates each comprising a time frame sequence of acoustic spectral parameters of a prescribed reference pattern and digital signal processor means comprising means responsive to said input signal for generating a time frame sequence of acoustic spectral parameters, means responsive to said input signal for generating a time frame sequence of acoustic nonspectral parameters, means for time aligning each of said reference pattern templates with said input signal based on reference pattern and input signal spectral parameters but independent of said nonspectral parameters, means for determining a set of similarity measures each representative of the similarity between spectral parameters of said input signal and spectral parameters of one of the time aligned reference pattern templates and means for selectively identifying said speech pattern in said input signal as one of said reference patterns based both on said similarity measures and on said nonspectral parameters, wherein said time aligning means comprises means for pairing, for each of said reference patterns, time frames of that reference pattern template with time frames of said input signal to maximize the similarity measure determined by said determining means for that reference pattern, said pairing defining a scan region of input signal time frames for that reference pattern, wherein said selectively identifying means comprises means for adjusting, for each of said reference patterns, the determined similarity measure based on said at least one nonspectral parameter and means for selectively identifying said speech pattern in said input signal as one of said reference patterns based on said adjusted similarity measures.
-
11. A speech recognizer for processing an input signal to recognize a speech pattern comprising
memory means for storing a plurality of reference pattern templates each comprising a time frame sequence of acoustic spectral parameters of a prescribed reference pattern and digital signal processor means comprising means responsive to said input signal for generating a time frame sequence of acoustic spectral parameters, means responsive to said input signal for generating a time frame sequence of voicing parameters, each of said voicing parameters defining the presence or absence of a vowel sound, means for determining a set of similarity measures each representative of the similarity between spectral parameters of said input signal and spectral parameters of one of the reference pattern templates and means for selectively identifying said speech pattern in said input signal as one of said reference patterns based both on said similarity measures and on said voicing parameters.
Specification