Systems and methods for speech recognition using frequency domain linear prediction polynomials to form temporal and spectral envelopes from frequency domain representations of signals
First Claim
1. An electronic method of extracting speech features from signals for use in performing automatic speech recognition, the method comprising using a computer processor to perform:
- receiving a signal;
performing a time-to-frequency domain transformation on at least a portion of the received signal to generate a frequency domain representation;
dividing the frequency domain representation into a plurality of frequency bands;
fitting a FDLP polynomial to each of the plurality of frequency bands;
performing a frequency-to-time domain transformation to extract temporal envelopes from each of the plurality of frequency bands using the fitted FDLP polynomial;
constructing spectral envelopes by taking a plurality of points at each of a plurality of time values in the temporal envelopes, wherein each of the spectral envelopes has points taken at a particular one of the time values;
fitting a smooth envelope to each of the spectral envelopes; and
generating at least one speech feature based at least in part on the temporal and spectral envelopes of each of the plurality of frequency bands.
0 Assignments
0 Petitions
Accused Products
Abstract
In accordance with the present invention, computer implemented methods and systems are provided for representing and modeling the temporal structure of audio signals. In response to receiving a signal, a time-to-frequency domain transformation on at least a portion of the received signal to generate a frequency domain representation is performed. The time-to-frequency domain transformation converts the signal from a time domain representation to the frequency domain representation. A frequency domain linear prediction (FDLP) is performed on the frequency domain representation to estimate a temporal envelope of the frequency domain representation. Based on the temporal envelope, one or more speech features are generated.
-
Citations
9 Claims
-
1. An electronic method of extracting speech features from signals for use in performing automatic speech recognition, the method comprising using a computer processor to perform:
-
receiving a signal; performing a time-to-frequency domain transformation on at least a portion of the received signal to generate a frequency domain representation; dividing the frequency domain representation into a plurality of frequency bands; fitting a FDLP polynomial to each of the plurality of frequency bands; performing a frequency-to-time domain transformation to extract temporal envelopes from each of the plurality of frequency bands using the fitted FDLP polynomial; constructing spectral envelopes by taking a plurality of points at each of a plurality of time values in the temporal envelopes, wherein each of the spectral envelopes has points taken at a particular one of the time values; fitting a smooth envelope to each of the spectral envelopes; and generating at least one speech feature based at least in part on the temporal and spectral envelopes of each of the plurality of frequency bands. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A system for extracting speech features from signals for use in performing automatic speech recognition, the system comprising:
-
means for receiving a signal; means for performing a time-to-frequency domain transformation on at least a portion of the received signal to generate a frequency domain representation; means for dividing the frequency domain representation into a plurality of frequency bands; means for fitting a FDLP polynomial to each of the plurality of frequency bands; means for performing a frequency-to-time domain transformation to extract temporal envelopes from each of the plurality of frequency bands using the fitted FDLP polynomial; means for constructing spectral envelopes by taking a plurality of points at each of a plurality of time values in the temporal envelopes, wherein each of the spectral envelopes has points taken at a particular one of the time values; means for fitting a smooth envelope to each of the spectral envelopes; and means for generating at least one speech feature based at least in part on the temporal and spectral envelopes of each of the plurality of frequency bands.
-
-
8. A system for extracting speech features from signals for use in performing automatic speech recognition, the system comprising:
a processor at least partially executing a modeling application configured to; receive a signal; perform a time-to-frequency domain transformation on at least a portion of the received signal to generate a frequency domain representation; divide the frequency domain representation into a plurality of frequency bands; fit a FDLP polynomial to each of the plurality of frequency bands; perform a frequency-to-time domain transformation to extract temporal envelopes from each of the plurality of frequency bands using the fitted FDLP polynomial; construct spectral envelopes by taking a plurality of points at each of a plurality of time values in the temporal envelopes, wherein each of the spectral envelopes has points taken at a particular one of the time values; fit a smooth envelope to each of the spectral envelopes; and generate at least one speech feature based at least in part on the temporal and spectral envelopes of each of the plurality of frequency bands.
-
9. A computer readable medium for storing computer executable instructions for extracting speech features from signals, the executable instructions comprising the steps of:
-
receiving a signal; performing a time-to-frequency domain transformation on at least a portion of the received signal to generate a frequency domain representation; dividing the frequency domain representation into a plurality of frequency bands; fitting a FDLP polynomial to each of the plurality of frequency bands; performing a frequency-to-time domain transformation to extract temporal envelopes from each of the plurality of frequency bands using the fitted FDLP polynomial; constructing spectral envelopes by taking a plurality of points at each of a plurality of time values in the temporal envelopes, wherein each of the spectral envelopes has points taken at a particular one of the time values; fitting a smooth envelope to each of the spectral envelopes; and generating at least one speech feature based at least in part on the temporal and spectral envelopes of each of the plurality of frequency bands.
-
Specification