Pitch lag estimation using frequency-domain lowpass filtering of the linear predictive coding (LPC) residual
First Claim
1. A system for estimating pitch lag for speech quantization and compression requiring substantially reduced complexity, the speech having a linear predictive coding (LPC) residual signal defined by a plurality of LPC residual samples, wherein the estimate of a current LPC residual sample is determined in the time domain according to a linear combination of past samples, further wherein the speech represents voiced and unvoiced speech falling within a typical frequency range having a fundamental frequency, the system comprising:
- means for applying a first discrete Fourier transform (DFT) to the plurality of LPC residual samples, the first DFT having an associated amplitude;
means for squaring the amplitude of the first DFT, the squared amplitude having high and low frequency components;
a filter for filtering out the high frequency components of the squared amplitude in the frequency domain, thereby providing for substantially reduced system complexity, wherein frequencies between zero and at least two times the typical frequency range of the speech are retained to ensure that at least one harmonic is obtained to prevent confusion in detecting the fundamental frequency;
means for applying a second DFT directly over the squared amplitude without taking the logarithm of the squared amplitude, the second DFT having associated quasi-time domain-transformed samples; and
means for determining an initial pitch lag value according to the time domain-transformed samples.
13 Assignments
0 Petitions
Accused Products
Abstract
A pitch estimation device and method utilizing a multi-resolution approach to estimate a pitch lag value of input speech. The system includes determining the LPC residual of the speech and sampling the LPC residual. A discrete Fourier transform is applied and the result is squared. A lowpass filtering step is carried out and a DFT on the squared amplitude is then performed to transform the LPC residual samples into another domain. An initial pitch lag can then be found with lower resolution. After getting the low-resolution pitch lag estimate, a refinement algorithm is applied to get a higher-resolution pitch lag. The refinement algorithm is based on minimizing the prediction error in the time domain. The refined pitch lag then can be used directly in the speech coding.
63 Citations
45 Claims
-
1. A system for estimating pitch lag for speech quantization and compression requiring substantially reduced complexity, the speech having a linear predictive coding (LPC) residual signal defined by a plurality of LPC residual samples, wherein the estimate of a current LPC residual sample is determined in the time domain according to a linear combination of past samples, further wherein the speech represents voiced and unvoiced speech falling within a typical frequency range having a fundamental frequency, the system comprising:
-
means for applying a first discrete Fourier transform (DFT) to the plurality of LPC residual samples, the first DFT having an associated amplitude; means for squaring the amplitude of the first DFT, the squared amplitude having high and low frequency components; a filter for filtering out the high frequency components of the squared amplitude in the frequency domain, thereby providing for substantially reduced system complexity, wherein frequencies between zero and at least two times the typical frequency range of the speech are retained to ensure that at least one harmonic is obtained to prevent confusion in detecting the fundamental frequency; means for applying a second DFT directly over the squared amplitude without taking the logarithm of the squared amplitude, the second DFT having associated quasi-time domain-transformed samples; and means for determining an initial pitch lag value according to the time domain-transformed samples. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A system operable with a computer for estimating pitch lag for input speech quantization and compression requiring substantially reduced complexity on the order of three times less complexity than standard pitch detection methods, the speech having a linear predictive coding (LPC) residual signal defined by a plurality of LPC residual samples, wherein the estimated pitch lag falls within a predetermined minimum and maximum pitch lag value range, further wherein the speech represents voiced and unvoiced speech within a typical frequency range having a fundamental frequency, the system comprising:
-
means for selecting a pitch analysis window among the LPC residual samples, the pitch analysis window being at least twice as large as the maximum pitch lag value; means for applying a first discrete Fourier transform (DFT) to the windowed plurality of LPC residual samples, the first DFT having an associated amplitude spectrum, the amplitude spectrum having low and high frequency components; a filter for filtering out the high frequency components of the amplitude spectrum in the frequency domain, thereby providing for substantially reduced system complexity, wherein frequencies between zero and at least two times the typical frequency range of the speech are retained to ensure that at least one harmonic is detected to prevent confusion in detecting the fundamental frequency; means for applying a second DFT directly over the amplitude spectrum of the first DFT without taking the logarithm of the squared amplitude, the second DFT being a 256-point DFT and having associated quasi-time domain-transformed samples such that the quasi-time domain-transformed samples are real values; means for applying a weighted average to the time domain-transformed samples, wherein at least two samples are combined to produce a single sample; means for searching the time-domain transformed speech samples to find at least one sample having a maximum peak value; and means for estimating an initial pitch lag value according to the sample having the maximum peak value. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
-
-
20. A speech coding apparatus for reproducing and coding input speech represents voiced and unvoiced speech within a typical frequency range of zero to 800 Hz having a fundamental frequency, the apparatus requiring substantially reduced complexity on the order of three times less complexity than standard autocorrelation methods, wherein the speech coding apparatus is operable with a linear predictive coding (LPC) excitation signal defining the decoded LPC residual of the input speech, LPC parameters, and an innovation codebook representing a plurality of vectors which are referenced to excite speech reproduction to generate speech, the speech coding apparatus comprising:
-
a computer for processing the LPC residual, wherein the computer includes; means for segregating a current coding frame within the LPC residual, means for dividing the coding frame into plural pitch subframes, means for defining a pitch analysis window having N LPC residual samples, the pitch analysis window extending across the pitch subframes, means for estimating an initial pitch lag value for each pitch subframe, including means for applying a first discrete Fourier transform (DFT) to the N LPC residual samples, the first DFT having an associated amplitude, means for squaring the amplitude of the first DFT, the squared amplitude having high and low frequency components, a filter for filtering out the high frequency components of the squared amplitude in the frequency domain, thereby providing for substantially reduced system complexity, wherein frequencies between zero and a least 1.6 kHz, equivalent to two times the typical frequency range of the speech, are retained to ensure that a least one harmonic is obtained to prevent confusion in determining the fundamental frequency, means for applying a second DFT directly over the squared amplitude without taking the logarithm of the squared amplitude, the second DFT being a 256-point DFT and having associated quasi-time domain-transformed samples such that the quasi-time domain-transformed samples are real values, means for dividing each pitch subframe into multiple coding subframes, wherein the initial pitch lag estimates for each pitch subframe represents the lag estimates for the last coding subframe of each pitch subframe in the current coding frame, means for linearly interpolating the estimated pitch lag values between the pitch subframes to determine a pitch lag estimate for each coding subframe, and means for refining the linearly interpolated lag values of each coding subframe; and speech output means for outputting speech reproduced according to the refined pitch lag values. - View Dependent Claims (21, 22)
-
-
23. A speech coding apparatus for reproducing and coding input speech representing voiced and unvoiced speech within a typical frequency range of zero to 800 Hz having a fundamental frequency, the apparatus requiring substantially reduced complexity on the order of 1 million instructions per second (MIPS), three times less complexity than standard autocorrelation methods requiring at least 3 MIPS, the input speech being filtered by an inverse linear predictive coding (LPC) filter to obtain the LPC residual of the input speech, the speech coding apparatus comprising:
-
a computer for processing the LPC residual and estimating an initial pitch lag of the LPC residual, wherein the pitch lag is between a minimum and maximum pitch lag value, the computer including means for defining a current pitch analysis window having N LPC residual samples, wherein N is a least two times the maximum pitch lag value, means for applying a 256-point first discrete Fourier transform (DFT) to the LPC residual samples in the current pitch analysis window, the first DFT having an associated amplitude spectrum, the amplitude spectrum having high and low frequency signals, filter for filtering out the high frequency signals of the amplitude spectrum in the frequency domain, wherein frequencies between zero and at least 1.6 kHz equivalent to two times the typical frequency range of the speech, are retained to ensure that at least one harmonic is obtained to prevent confusion in determining the fundamental frequency, means for applying a 256-point second DFT directly over the amplitude of the first DFT to produce quasi-time domain-transformed samples without taking the logarithm of the squared amplitude, means for applying a weighted average to the time domain-transformed samples, wherein at least two samples are combined to produce a single sample, and means for searching the average time domain-transformed samples to find at least one peak, wherein the position of the highest peak represents the estimated pitch lag in the current pitch analysis window; and speech output means for outputting speech reproduced according to the estimated pitch lag value. - View Dependent Claims (24, 25, 26, 27, 28, 29, 30)
-
-
31. A method of estimating pitch lag for quantization and compression of speech representing voiced and unvoiced speech within a typical frequency range of zero to 800 Hz having a fundamental frequency, the speech being represented by a linear predictive coding (LPC) residual which is defined by a plurality of LPC residual samples, wherein the estimation of a current LPC residual sample is determined in the time domain according to a linear combination of past samples, the method comprising the steps of:
-
applying a first discrete Fourier transform (DFT) to the LPC residual samples, the first DFT having an associated amplitude; squaring the amplitude of the first DFT, the squared amplitude having high and low frequency components; filtering out the high frequency components of the squared amplitude in the frequency domain, wherein frequencies between zero and at least 1.6 kHz are retained to ensure that at least on harmonic is obtained to accurately determine the fundamental frequency; applying a second DFT directly over the filtered square amplitude of the first DFT without taking the logarithm of the squared amplitude, to produce time domain-transformed LPC residual samples; determining an initial pitch lag value according to the time domain-transformed LPC residual samples, the initial pitch lag value having an associated prediction error; refining the initial pitch lag value using autocorrelation, wherein the associated prediction error is minimized; and coding the LPC residual samples according to the refined pitch lag value. - View Dependent Claims (32, 33, 34, 35, 36)
-
-
37. A speech coding method for reproducing and coding input speech operable with a computer system requiring substantially reduced complexity on the order of three times less complexity than standard autocorrelation systems, the speech representing voiced and unvoiced speech within a typical frequency range of zero to 800 Hz having a fundamental frequency, wherein the speech is represented by a linear predictive coding (LPC) excitiation signal defining the decoded LPC residual of the input speech, the method comprising the steps of:
-
processing the LPC residual and estimating an inital pitch lag of the LPC residual, wherein the pitch lag is between a minimum and maximum pitch lag value; defining a current pitch analysis window having N LPC residual samples, wherein N is a least two times the maximum pitch lag value; applying a 256-point first discrete Fourier transform (DFT) to the LPS residual samples in the current pitch analysis window, the first DFT having an associated amplitude spectrum having high and low frequency components; filtering out the high frequency components of the amplitude spectrum of the first DFT in the frequency domain, wherein frequencies between zero and at least 1.6 kHz, equivalent to two times the typical frequency range of the speech, are retained to ensure that at least one harmonic is obtained to prevent confusion in determining the fundamental frequency; applying a 256-point second DFT directly over the amplitude of the first DFT without taking the logarithm of the squared amplitude to produce time domain-transformed samples such that the time domain-transformed samples are real values and the spectrum phase information is preservered; applying a weighted average to the time domain-transformed samples, wherein at least two samples are combined to produce a single sample; and searching the averaged time domain-transformed samples to find at least on peak, wherein the position of the highest peak represents the estimated pitch lag in the current pitch analysis window; and speech output means for outputting speech reproduced according to the estimated pitch lag value. - View Dependent Claims (38, 39, 40, 41, 42, 43, 44)
-
-
45. A speech coding method for reproducing and coding input speech representing voiced and unvoiced speech within a typical frequency range of zero to 800 Hz having a fundamental frequency, the method requiring substantially reduced complexity on the order of 1 million instructions per second (MIPS), three times less complexity than standard autocorrelation methods requiring at least 3 MIPS, the speech coding apparatus operable with a linear predictive coding (LPC) excitation signal defining the decoded LPC residual of the input speech, LPC parameters, and an innovation codebook representing pseudo-random signals which form a plurality of vectors which are referenced to excite speech reproduction to generate speech, the speech coding method comprising the steps of:
-
receiving and processing the input speech; processing the input speech, wherein the step of processing includes; determining the LPC residual of the input speech, determining a coding frame within the LPC residual, subdividing the coding frame into plural pitch subframes, defining a pitch analysis window having N LPC residual samples, the pitch analysis window extending across the pitch subframes, roughly estimating an initial pitch lag value for each pitch subframe, by applying a first discrete Fourier transform (DFT) to the LPC residual samples, the first DFT having an associated amplitude, squaring the amplitude of the first DFT, the squared amplitude having phase information and being represented by low and high frequency components, filtering out the high frequency components of the squared amplitude in the frequency domain to retain frequencies between zero and at least 1.6 kHz to ensure that at least one harmonic is found to accurately determine the fundamental frequency, applying a second DFT directly over the squared amplitude of the first DFT without taking the logarithm of the square amplitude to produce time domain-transformed LPC residual samples, the second DFT being a 256-point DFT such that the time domain-transformed LPC residual samples are real values, determining an initial pitch lag value according to the time domain-transformed LPC residual samples, dividing each pitch subframe into multiple coding subframes, such that the initial pitch lag estimate for each pitch subframe represents the lag estimate for the last coding subframe of each pitch subframe, and interpolating the estimated pitch lag values between the pitch subframes for determining a pitch lag estimate for each coding subframe, and refining the linearly interpolated lag values; and outputting speech reproduced according to the refined pitch lag values.
-
Specification