Method and apparatus for speech synthesis with efficient spectral smoothing

US 6,253,182 B1
Filed: 11/24/1998
Issued: 06/26/2001
Est. Priority Date: 11/24/1998
Status: Expired due to Term

First Claim

Patent Images

1. A method for synthesizing speech from input speech segments, at least one input speech segment having an original prosody and having a mixed portion with a voiced component and an unvoiced component, the method comprising:

selecting an input speech segment;

identifying an output prosody;

changing the original prosody of the selected input speech segment to produce an output speech segment so that the prosody of the output speech segment matches the output prosody through steps comprising;

changing the prosody of a voiced component of a mixed portion of the input speech segment to produce an output voiced component;

changing the prosody of an unvoiced component of the mixed portion of the input speech segment to produce an output unvoiced component by generating a frequency-domain representation of the unvoiced component directly from the input speech segment and changing the frequency-domain representation to change the prosody of the unvoiced component;

combining the output voiced component and the output unvoiced component to produce an output mixed portion for an output speech segment; and

combining the output speech segment with other speech segments to form synthesized speech.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The present invention provides a method for synthesizing speech by modifying the prosody of individual components of a training speech signal and then combining the modified speech segments. The method includes selecting an input speech segment and identifying an output prosody. The prosody of the input speech segment is then changed by independently changing the prosody of a voiced component and an unvoiced component of the input speech signal. These changes produce an output voiced component and an output unvoiced component that are combined to produce an output speech segment. The output speech segment is then combined with other speech segments to form synthesized speech.

53 Citations

View as Search Results

20 Claims

1. A method for synthesizing speech from input speech segments, at least one input speech segment having an original prosody and having a mixed portion with a voiced component and an unvoiced component, the method comprising:
- selecting an input speech segment;
  
  identifying an output prosody;
  
  changing the original prosody of the selected input speech segment to produce an output speech segment so that the prosody of the output speech segment matches the output prosody through steps comprising;
  
  changing the prosody of a voiced component of a mixed portion of the input speech segment to produce an output voiced component;
  
  changing the prosody of an unvoiced component of the mixed portion of the input speech segment to produce an output unvoiced component by generating a frequency-domain representation of the unvoiced component directly from the input speech segment and changing the frequency-domain representation to change the prosody of the unvoiced component;
  
  combining the output voiced component and the output unvoiced component to produce an output mixed portion for an output speech segment; and
  
  combining the output speech segment with other speech segments to form synthesized speech.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The method of claim 1 wherein changing the prosody of the voiced component comprises generating a frequency-domain representation of the voiced component and changing the frequency-domain representation to change the prosody of the voiced component.
  - 3. The method of claim 2 wherein generating the frequency-domain representation comprises generating original sets of spectral values with one set of values for each of a plurality of input time marks, each set of spectral values describing the spectral content of a segment of the voiced component that extends along a period of time that includes the input time mark.
  - 4. The method of claim 3 wherein changing the prosody of the voiced component further comprises:
5. The method of claim 4 wherein changing the prosody of the voiced component further comprises, before creating the set of descriptor functions, time shifting at least one input mark and its associated original set of spectral values such that the duration of a portion of the output voiced component is different than the duration of a corresponding portion of the voiced component of the input speech segment.
6. The method of claim 5 wherein creating a descriptor functions comprises interpolating between a plurality of spectral values.
7. The method of claim 6 wherein interpolating comprises filtering a plurality of spectral values over time such that the amount by which the spectral values can change over time is limited.
8. The method of claim 4 wherein the input speech segment comprises at least two speech units.
9. The method of claim 8 wherein creating a descriptor functions comprises filtering a plurality of spectral values over time such that the amount by which the spectral values can change between speech units is limited.
10. The method of claim 1 wherein generating the frequency-domain representation comprises generating original sets of spectral values with one set of values for each of a plurality of input time marks, each set of spectral values describing the magnitudes of a set of discrete frequencies that contribute to the content of a segment of the unvoiced component that extends along a period of time that includes the input time mark.
11. The method of claim 10 wherein changing the prosody of the unvoiced component further comprises:
- creating a set of descriptor functions based on the original sets of spectral values, each descriptor function describing the magnitude of a respective frequency'"'"'s contribution to the output speech signal over time;
  
  identifying a plurality of output time marks different than the plurality of input time marks; and
  
  determining output sets of magnitudes based on the output time marks and the descriptor functions.
12. The method of claim 1 wherein changing the prosody of the unvoiced component further comprises adding spectral phases to the output sets of magnitudes to produce the output unvoiced component.

13. A method for synthesizing speech based on an input text comprising:
- converting a time-domain training speech signal into a set of frequency-domain values;
  
  quantizing the frequency-domain values into a set of codewords;
  
  storing the codewords in a component database;
  
  retrieving codewords from the component database based on the input text;
  
  filtering the codewords directly to produce a descriptor function, the filtering such that the rate of change of the descriptor function is limited;
  
  identifying an output set of frequency-domain values based on the descriptor function; and
  
  converting the frequency-domain values to time-domain values representing portions of the synthesized speech.
- View Dependent Claims (14, 15, 16, 17, 18)
- - 14. The method of claim 13 wherein a single codeword represents multiple frequency-domain values and wherein quantizing the frequency-domain values comprises selecting a codeword from a set of codewords based on which codeword best approximates the multiple frequency-domain values.
  - 15. The method of claim 13 wherein filtering the codewords comprises filtering across two speech units in the synthesized speech.
  - 16. The method of claim 13 wherein filtering the codewords and identifying an output set of frequency-domain values based on the descriptor function reduces errors created by quantizing the frequency-domain values into a set of codewords.
  - 17. The method of claim 13 wherein identifying an output set of frequency-domain values comprises identifying an output prosody for the synthesized speech and determining the value of the descriptor function at time marks associated with the output prosody.
  - 18. The method of claim 17 wherein identifying an output prosody comprises identifying a prosody that is different than a prosody of the training speech signal.

19. A computer-readable medium having computer executable instructions for synthesizing speech from input speech segments, at least one input speech segment having an original prosody and having a mixed portion with a voiced component and an unvoiced component, the method comprising:
- selecting an input speech segment;
  
  identifying an output prosody;
  
  changing the original prosody of the selected input speech segment to produce an output speech segment so that the prosody of the output speech segment matches the output prosody through steps comprising;
  
  changing the prosody of a voiced component of a mixed portion of the input speech segment to produce an output voiced component;
  
  changing the prosody of an unvoiced component of the mixed portion of the input speech segment to produce an output unvoiced component by generating a frequency-domain representation of the unvoiced component directly from the input speech segment and changing the frequency-domain representation to change the prosody of the unvoiced component;
  
  combining the output voiced component and the output unvoiced component to produce an output mixed portion for an output speech segment; and
  
  combining the output speech segment with other speech segments to form synthesized speech.

20. A computer-readable medium having computer-executable instructions for synthesizing speech based on an input text according to a method comprising:
- retrieving codewords from a component database based on the input text, the codewords representing frequency-domain values indicative of a training speech signal;
  
  filtering the codewords directly to produce a descriptor function, the filtering such that the rate of change of the descriptor function is limited;
  
  identifying an output set of frequency-domain values based on the descriptor function; and
  
  converting the frequency-domain values to time-domain values representing portions of the synthesized speech.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Acero, Alejandro
Primary Examiner(s)
Tsang, Fan
Assistant Examiner(s)
SAX, ROBERT L

Application Number

US09/198,661
Time in Patent Office

945 Days
Field of Search

704/268, 704/258
US Class Current

704/268
CPC Class Codes

G10L 13/10 Prosody rules derived from ...

Method and apparatus for speech synthesis with efficient spectral smoothing

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

53 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for speech synthesis with efficient spectral smoothing

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

53 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links