Statistical methods and apparatus for pitch extraction in speech recognition, synthesis and regeneration
First Claim
1. A method for pitch extraction in speech recognition, synthesis and regeneration comprising the steps of:
- performing autocorrelation of a digitized speech input to produce an autocorrelation function;
selecting at least the three highest peaks from the autocorrelation function;
calculating top ranked frequencies for the at least three highest peaks;
determining a plurality of frequency candidates from the calculated frequencies;
identifying valid and non-valid frames of the input speech;
determining pitch values for each frame of the received input speech using the positions of the selected peaks and an energy value representing the instantaneous voice energy;
maintaining a running average of determined pitch values; and
performing a weighted dynamic least squares fit of the identified valid and non-valid frames to estimate the pitch value using a least squares fit to a cubic function.
1 Assignment
0 Petitions
Accused Products
Abstract
A method and apparatus for extracting pitch value information from speech. The method selects at least three highest peaks from a normalized autocorrelation function and produces a plurality of frequency candidates for pitch value determination. The plurality of frequency candidates are used to identify anchor points in pitch values, and is further used to perform both forward and backward searching when an anchor point cannot be readily identified. The running mean or average of determined pitch values is maintained and used in conjunction with the identified valid pitch values in a final determination of the pitch estimation using a weighted least squares fit for identified non-valid frames.
62 Citations
11 Claims
-
1. A method for pitch extraction in speech recognition, synthesis and regeneration comprising the steps of:
-
performing autocorrelation of a digitized speech input to produce an autocorrelation function; selecting at least the three highest peaks from the autocorrelation function; calculating top ranked frequencies for the at least three highest peaks; determining a plurality of frequency candidates from the calculated frequencies; identifying valid and non-valid frames of the input speech; determining pitch values for each frame of the received input speech using the positions of the selected peaks and an energy value representing the instantaneous voice energy; maintaining a running average of determined pitch values; and performing a weighted dynamic least squares fit of the identified valid and non-valid frames to estimate the pitch value using a least squares fit to a cubic function. - View Dependent Claims (2, 3, 4)
-
-
5. An apparatus for pitch extraction in speech recognition, synthesis and regeneration comprising:
-
input means for receiving a speech waveform; processing means connected to said input means for receiving said speech waveform; means for generating an autocorrelation function of the input speech waveform and extracting raw pitch values from frames of the autocorrelation function of said input speech waveform by using acoustic occurrences that occur both prior to and after a moment of pitch maintaining a running average of determined row pitch values means for estimating true pitch values by processing the raw pitch values using a weighted dynamic least squares process using a least squares fit to a cubic function. - View Dependent Claims (6, 7)
-
-
8. A program storage device readable by a machine, tangibly embodying a program of instructions executable by the machine to perform method steps for extracting pitch from a speech signal, the method comprising the steps of:
-
performing autocorrelation of a digitized speech waveform to produce an autocorrelation function; selecting at least three highest peaks from the autocorrelation function for each frame of the digitized waveform; selecting a plurality of frequency candidates for each frame, the frequency candidates being three top-ranked frequencies calculated from the at least three highest peaks and at least the first and second harmonics of the at least three calculated top-ranked frequencies; determining a raw pitch value for each frame using the plurality of frequency candidates and an energy value representing instantaneous voice energy; maintaining a running average of determined raw pitch values; identifying valid and non-valid frames of the input speech, wherein the valid frames have a determined raw pitch value and the non-valid frames do not have a determined raw pitch value; assigning the running average of the determined raw pitch values as the raw pitch value for an identified non-valid frame; and performing a weighted dynamic least squares fit of the identified valid and non-valid frames to estimate the pitch value using a least squares fit to a cubic function. - View Dependent Claims (9, 10, 11)
-
Specification