Speech analysis/synthesis system with silence suppression
First Claim
1. A speech coding system, comprising:
- an analyzer connected to recieve speech input data and to generate therefrom a sequence of frames of speech parameters, said frames each having plural parameters including an energy value;
a buffer connected to said analyzer for storing up to a predetermined number of said frames;
a nonsilent energy tracker for adjusting a value representing an energy contour for nonsilent frames;
a silent energy tracker for adjusting a value representing an energy contour for silent frames; and
silence suppression means connected to said buffer, and to said silent and nonsilent energy trackers, for identifying each frame as silent or nonsilent, wherein said silence suppression means, once a nonsilent frame has been identified, identifies a silent frame only when a continuous succession of frames having an energy less than a predetermined function of the silent energy contour value is generated, and wherein said silence suppression means, once a silent frame has been identified, identifies a nonsilent frame only when a voiced frame having an energy higher than a predetermined function of the nonsilent energy contour value is generated;
wherein, when a silent frame is identified following a nonsilent frame, all previous frames in said buffer which have an energy less than a predetermined function of the silent energy contour value are retroactively identified as silent;
and wherein, when a nonsilent voiced frame is identified following a silent frame, all previous frames in said buffer which have an energy value greater than a predetermined function of the nonsilent energy contour value, and which are not separated from the nonsilent voiced frame by more than a selected number of frames having an energy level less than the predetermined function of the nonsilent energy contour value, are identified as nonsilent frames.
1 Assignment
0 Petitions
Accused Products
Abstract
Silence suppression in speech synthesis systems is achieved by detecting and processing only segments of voice activity. A segment is classified as "speech" if the energy of the signal is greater than an adaptively adjusted threshold. The adaptively adjusted threshold is preferably defined as the maximum of scaled values of two separate envelope parameters, which both track the variation in energy over the sequence of frames of speech data. One contour is a slow-rising fast-falling value, which is updated only during unvoiced speech frames, and therefore track a lower envelope of the energy contour. This parameter in effect tracks an ambiant noise level. The other parameter is a fast-rising slow-falling parameter, which is updated only during voiced speech frames, and thus tracks an upper envelope of the energy contour. (This in effect tracks the average speech level.) A nonsilent energy tracker and a silent energy tracker adjust corresponding energy values representing the energy contours.
109 Citations
2 Claims
-
1. A speech coding system, comprising:
-
an analyzer connected to recieve speech input data and to generate therefrom a sequence of frames of speech parameters, said frames each having plural parameters including an energy value; a buffer connected to said analyzer for storing up to a predetermined number of said frames; a nonsilent energy tracker for adjusting a value representing an energy contour for nonsilent frames; a silent energy tracker for adjusting a value representing an energy contour for silent frames; and silence suppression means connected to said buffer, and to said silent and nonsilent energy trackers, for identifying each frame as silent or nonsilent, wherein said silence suppression means, once a nonsilent frame has been identified, identifies a silent frame only when a continuous succession of frames having an energy less than a predetermined function of the silent energy contour value is generated, and wherein said silence suppression means, once a silent frame has been identified, identifies a nonsilent frame only when a voiced frame having an energy higher than a predetermined function of the nonsilent energy contour value is generated; wherein, when a silent frame is identified following a nonsilent frame, all previous frames in said buffer which have an energy less than a predetermined function of the silent energy contour value are retroactively identified as silent; and wherein, when a nonsilent voiced frame is identified following a silent frame, all previous frames in said buffer which have an energy value greater than a predetermined function of the nonsilent energy contour value, and which are not separated from the nonsilent voiced frame by more than a selected number of frames having an energy level less than the predetermined function of the nonsilent energy contour value, are identified as nonsilent frames.
-
-
2. A method for identifying frames of speech in a sequence as silent or nonsilent, comprising the steps of:
-
(a) buffering a selected number of frames for which identification as silent or nonsilent may be changed; (b) maintaining an updated nonsilent energy value representing the energies of frames identified as nonsilent; (c) maintaining an updated silent energy value representing the energies of frames identified as silent; (d) maintaining a threshold value which is selected from a first function of the updated nonsilent energy value and a second function of the updated silent energy value; (e) once a nonsilent frame has been identified, only identifying a silent frame after a preselected number of consecutive frames have energies less than the threshold value, and retroactively identifying preceeding frames having energies less than the threshold value as silent; and (f) once a silent frame has been identified, only identifying a nonsilent frame after a voiced frame having an energy greater than the threshold is received, and retroactively identifying preceeding frames having energies greater than the threshold, and separated from the voiced frame by less than a selected number of frames having energies less than the threshold, as nonsilent.
-
Specification