Voice synthesis utilizing multi-level filter excitation

US 4,890,328 A
Filed: 08/28/1985
Issued: 12/26/1989
Est. Priority Date: 08/28/1985
Status: Expired due to Term

First Claim

Patent Images

1. A processing system for the analysis and synthesis of human speech comprising:

means for storing a plurality of speech frames each having a predetermined number of evenly spaced samples of instantaneous amplitudes of said speech;

means for calculating a set of speech parameter signals defining a vocal tract for each speech frame;

means for designating a first subset of said plurality of speech frames as voiced and a second subset of said plurality of speech frames as unvoiced;

means for generating pitch type excitation information for each frame of said first subset of said plurality of speech frames;

means for producing a plurality of other types of excitation information for each frame of said second subset of said plurality of speech frames;

means responsive to said designating means designating each frame of said first subset of said plurality of speech frames for combining said pitch type excitation information and said set of said speech parameter signals;

said combining means further comprises means responsive to said designating means designating each frame of said second subset of said plurality of speech frames for selecting one of said other types of excitation information and means for combining the selected one of said other types of excitation information with the set of said speech parameter signals; and

means for communicating said combined excitation information including said pitch type excitation information and the set of said speech parameter signals for each frame of said first subset of said plurality of speech frames and said combined excitation information including the selected one of said other types of excitation information and the set of said speech parameter signals for each of frame of said second subset of said plurality of speech frames.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A speech analysis and synthesis system where pitch information for excitation is transmitted during voice segments of speech and pulse excitation or noise excitation is transmitted during unvoiced speech segments along with linear predictive coding (LPC) parameters. The decision of whether to transmit noise excitation or pulse excitation is performed by comparing the variance of the residual to the square of the mean amplitude of the rectified residual for each frame. If the result of this comparison is greater than a threshold value, pulse excitation is utilized otherwise noise excitation is used. The pulse excitation comprises a subset of samples of the LPC residual as determined by the relative amplitudes and spacing of the local maxima in the LPC residual.

Citations

24 Claims

1. A processing system for the analysis and synthesis of human speech comprising:
- means for storing a plurality of speech frames each having a predetermined number of evenly spaced samples of instantaneous amplitudes of said speech;
  
  means for calculating a set of speech parameter signals defining a vocal tract for each speech frame;
  
  means for designating a first subset of said plurality of speech frames as voiced and a second subset of said plurality of speech frames as unvoiced;
  
  means for generating pitch type excitation information for each frame of said first subset of said plurality of speech frames;
  
  means for producing a plurality of other types of excitation information for each frame of said second subset of said plurality of speech frames;
  
  means responsive to said designating means designating each frame of said first subset of said plurality of speech frames for combining said pitch type excitation information and said set of said speech parameter signals;
  
  said combining means further comprises means responsive to said designating means designating each frame of said second subset of said plurality of speech frames for selecting one of said other types of excitation information and means for combining the selected one of said other types of excitation information with the set of said speech parameter signals; and
  
  means for communicating said combined excitation information including said pitch type excitation information and the set of said speech parameter signals for each frame of said first subset of said plurality of speech frames and said combined excitation information including the selected one of said other types of excitation information and the set of said speech parameter signals for each of frame of said second subset of said plurality of speech frames.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13, 14, 15, 16, 17)
- - 2. The system of claim 1 wherein said producing means comprises means for determining pulses from said speech samples for each frame of said second subset of said plurality of speech frames to provide pulse type excitation.
  - 3. The system of claim 2 wherein said determining means comprises means for calculating residual samples from said speech samples for each frame of said second subset of said plurality of speech frames;
    - andmeans for locating a subset of pulses of said residual samples having maximum amplitudes for each frame of said second subset of said plurality of speech frames.
  - 4. The system of claim 3 wherein said selecting means comprises means for calculating a variance of the residual samples for each frame of said second subset of said plurality of speech frames;
    - means for rectifying said residual samples;
      
      means for calculating the means amplitude of the rectified residual samples;
      
      means for calculating a square of the means amplitude of said rectified residual samples in each frame of said second subset of said plurality of speech frames;
      
      means for comparing the calculated variance of the residual to the calculated square of the mean amplitude of the rectified residual for each frame of said second subset of said plurality of speech frames; and
      
      means for designating said pulse type excitation information to be selected upon the comparison being greater than a predetermined threshold.
  - 5. The system of claim 3 wherein said selecting means comprises means for squaring each residual sample of each of said frames;
    - means for summing together all of the squared residual samples for each of said frames;
      
      means for multiplying said predetermined number of samples in a frame by the sum of said squared residual samples for each of said frames to generate a value;
      
      means for obtaining an absolute value for each of said residual samples in each of said frames;
      
      means for summing all of the absolute residual sample values for each of said frames; and
      
      means for squaring the summed absolute residual sample values for each of said frames to generate another value;
      
      means for comparing said value to said other value for each of said frames; and
      
      means for designating said pulse type excitation information to be selected upon said comparison being greater than a predetermined threshold.
  - 6. The system of claim 5 wherein said means for calculating said set of speech parameter signals comprises means for calculating a set of linear predictive coded parameter for each of said frames.
  - 7. The system of claim 6 wherein said means for generating said pitch type excitation information comprises:
    - a plurality of identical means each utilizing an individual predetermined portion of said speech samples of each of said frames for estimating an individual pitch value for each of said frames; and
      
      means responsive to each of said estimating means estimating each of said estimated individual pitch values for determining a final pitch value for each of said frames.
  - 8. The system of claim 7 wherein said final pitch value determining means comprises:
    - means for calculating said final pitch value from said estimated individual pitch values each received from an individual one of said estimating means for each of said frames; and
      
      means for constraining said final pitch value so that the calculated final pitch value for each of said frames is consistent with the calculated pitch values from previous ones of said frames to said each of said frames.
  - 9. The system of claim 5 further comprises means for receiving said communicated combined excitation information and said set of speech parameter signals for each of said frames;
    - means for synthesizing each frame of speech utilizing said set of speech parameter signals and said pitch excitation information upon said pitch excitation information being communicated; and
      
      said synthesizing means further utilizing said set of speech parameter signals and one of said plurality of other types of excitation information to synthesize each frame of speech utilizing said one of said other types of excitation information upon said other types of excitation information being communicated.
  - 10. The system of claim 9 wherein said synthesizing means further comprises means for generating an unvoiced type signal upon said other types of excitation information being communicated;
    - means for generating a pulse type signal upon said pulse type excitation information being communicated;
      
      means responsive to said unvoiced type signal and the absence of said pulse type signal for generating noise type excitation information; and
      
      means responsive to said pulse type signal for selecting said pulse type excitation information.
  - 12. The system of claim 1 wherein the means for forming said pitch excitation information comprises:
    - means for detecting the presence of said fundamental frequency in the samples of said frames;
      
      means for calculating said pitch in each of said frames; and
      
      means for forming said calculated pitch into said excitation information upon said detecting means determining the presence of said fundamental frequency.
  - 13. The system of claim 12 wherein said means for forming said excitation information from said other excitation source comprises means for determining pulses from said speech samples for each of said frames to provide the excitation information from said other excitation source.
  - 14. The system of claim 13 wherein said determining means comprises means for calculating residual samples from said speech samples for each of said frames;
    - andmeans for locating a subset of pulses of said residual samples having maximum amplitudes for each of said frames.
  - 15. The system of claim 14 wherein said means for forming said excitation information from said other source further comprises means for calculating a variance of said residual samples for each of said frames;
    - means for rectifying said residual samples;
      
      means for calculating the mean amplitude of the rectified residual samples;
      
      means for calculating a square of the mean amplitude of said rectified residual samples in each frame;
      
      means for comparing the calculated variance of the residual to the calculated square of the mean amplitude of the rectified residual for each of said frames; and
      
      means for selecting said excitation information from said other source to be said pulse type information upon the comparison being greater than a predetermined threshold.
  - 16. The system of claim 15 wherein said means for calculating said pitch in each of said frames comprises:
    - a plurality of identical means each utilizing an individual predetermined portion of said speech samples of each of said frames for estimating an individual pitch value for each of said frames; and
      
      means utilizing each of said estimated individual pitch values from each of said estimating means for determining a final pitch value for each of said frames.
  - 17. The system of claim 16 wherein said means for determining said final pitch value comprises:
    - means for calculating said final pitch value from said estimated individual pitch values each received from an individual one of said estimating means for each of said frames; and
      
      means for constraining said final pitch value so that the calculated pitch value for each of said frames is consistent with the calculated pitch values from previous ones of said frames to said each of said frames.

11. A processing system for the analysis and synthesis of human speech comprising:
- means for storing a plurality of speech frames each having a predetermined number of evenly spaced samples of instantaneous amplitudes of said speech;
  
  means for calculating a set of speech parameter signals defining a vocal tract for each speech frame;
  
  means for detecting speech resulting from a fundamental frequency and a noise-like source for each speech frame;
  
  means for forming pitch excitation information for each frame upon the frame containing said fundamental frequency;
  
  means for forming excitation information to indicate that noise excitation information is to be used to synthesize each of said frames upon speech of the frame resulting from said noise-like source in the human larynx;
  
  means for forming excitation information from another excitation source upon an absence of said fundamental frequency and said noise-like source; and
  
  means for combining the formed excitation information and the set of parameter signals of each frame for communication.
- View Dependent Claims (18, 19, 20)
- - 18. The system of claim 11 wherein said means for calculating said set of speech parameter signals comprises means for calculating a set of linear predictive coded parameters for each of said frames.
  - 19. The system of claim 11 further comprises means for receiving the information communicated from said combining means for each of said frames;
    - means for synthesizing each frame of speech utilizing said set of speech parameter signals and said pitch excitation information upon said pitch excitation information being communicated;
      
      said synthesizing means comprising means for generating noise excitation information;
      
      said synthesizing means further utilizing said set of speech parameter signal and said generated noise excitation information to synthesize each frame of speech upon excitation information in said received information indicating the use of said noise excitation information; and
      
      said synthesizing means further utilizing said set of speech parameter signals and one of said plurality of other types of excitation information to synthesize each frame of speech utilizing said one of said other type of excitation information upon said other types of excitation information being communicated.
  - 20. The system of claim 19 wherein said synthesizing means further comprises means for generating an unvoiced type signal upon said other types of excitation information being communicated;
    - means for generating a pulse type signal upon said pulse type excitation information being communicated;
      
      means responsive to said unvoiced type signal and absence of said pulse type signal for generating noise type excitation information; and
      
      means responsive to said pulse type signal for selecting said pulse type excitation information.

21. A method for analyzing and synthesizing human speech with a system comprising a quantizer for converting the speech into frames of digital samples and a digital signal processor responsive to a plurality of program instructions to analyze and synthesize the speech, said method comprising the steps of:
- storing a plurality of speech frames each having a predetermined number of evenly spaced samples of instantaneous amplitudes of said speech;
  
  calculating a set of speech parameter signals defining a vocal tract for each speech frame;
  
  designating a first subset of said plurality of speech frames as voiced and a second subset of said plurality of speech frames as unvoiced;
  
  generating pitch type excitation information for each frame of said first subset of said plurality of speech frames;
  
  producing a plurality of other types of excitation information for each frame of said second subset of said plurality of speech frames;
  
  combining said pitch type excitation information and said set of speech parameter signals for each frame of said first subset of said plurality of speech frames designated as voiced;
  
  selecting one of said other types of excitation for each frame of said second subset of said plurality of speech frames;
  
  combining the selected one of said other type of excitation information with the set of said speech parameters for each frame of said second subset of said plurality of speech frames; and
  
  communicating said combined excitation information including said pitch-type excitation information and the set of said speech parameter signals for each frame of said first subset of said plurality of speech frames and said combined excitation information including the selected one of said other types of excitation information and the set of said speech parameter signals for each frame of said second subset of said plurality of speech frames.
- View Dependent Claims (22, 23, 24)
- - 22. The method of claim 21 wherein said producing step comprises the steps of calculating residual samples from said speech samples for each frame of said second subset of said plurality of speech frames;
    - anddetermining pulses from said residual samples for each frame of said second subset of said plurality of speech frames to provide pulse type excitation.
  - 23. The method of claim 22 wherein said determining step comprises the step of locating a subset of pulses of said residual samples having maximum amplitudes for each frame of said second subset of said plurality of speech frames.
  - 24. The method of claim 23 wherein said selecting step comprises the step of calculating a variance of the residual samples for each frame of said second subset of said plurality of speech frames;
    - rectifying said residual samples;
      
      calculating the means amplitude of the rectified residual samples;
      
      calculating a square of the means amplitude of the rectified residual samples in each frame of said second subset of said plurality of speech frames;
      
      comparing the calculating variance and the calculated square of the means amplitude for each frame of said second subset of said plurality of speech frames; and
      
      designating said pulse type information to be selected upon the comparison being greater than a predetermined threshold.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
American Telephone & Telegraph Company (AT&T, Inc.), Bell Telephone Laboratories, Inc. (Nokia Corporation)
Original Assignee
American Telephone & Telegraph Company (AT&T, Inc.), AT&T, Inc.
Inventors
Thomson, David L., Prezas, Dimitrios P.
Primary Examiner(s)
Harkcom, Gary V.
Assistant Examiner(s)
Merecki, John A.

Application Number

US06/770,631
Time in Patent Office

1,581 Days
Field of Search

381/36-41, 381/49, 381/29-35, 381/51-53, 369/513.5
US Class Current

704/223
CPC Class Codes

G10L 19/08 Determination or coding of ...

Voice synthesis utilizing multi-level filter excitation

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

24 Claims

Specification

Solutions

Use Cases

Quick Links

Voice synthesis utilizing multi-level filter excitation

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

24 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links