Method and apparatus for speech synthesis with efficient spectral smoothing
First Claim
1. A method for synthesizing speech from input speech segments, at least one input speech segment having an original prosody and having a mixed portion with a voiced component and an unvoiced component, the method comprising:
- selecting an input speech segment;
identifying an output prosody;
changing the original prosody of the selected input speech segment to produce an output speech segment so that the prosody of the output speech segment matches the output prosody through steps comprising;
changing the prosody of a voiced component of a mixed portion of the input speech segment to produce an output voiced component;
changing the prosody of an unvoiced component of the mixed portion of the input speech segment to produce an output unvoiced component by generating a frequency-domain representation of the unvoiced component directly from the input speech segment and changing the frequency-domain representation to change the prosody of the unvoiced component;
combining the output voiced component and the output unvoiced component to produce an output mixed portion for an output speech segment; and
combining the output speech segment with other speech segments to form synthesized speech.
2 Assignments
0 Petitions
Accused Products
Abstract
The present invention provides a method for synthesizing speech by modifying the prosody of individual components of a training speech signal and then combining the modified speech segments. The method includes selecting an input speech segment and identifying an output prosody. The prosody of the input speech segment is then changed by independently changing the prosody of a voiced component and an unvoiced component of the input speech signal. These changes produce an output voiced component and an output unvoiced component that are combined to produce an output speech segment. The output speech segment is then combined with other speech segments to form synthesized speech.
53 Citations
20 Claims
-
1. A method for synthesizing speech from input speech segments, at least one input speech segment having an original prosody and having a mixed portion with a voiced component and an unvoiced component, the method comprising:
-
selecting an input speech segment;
identifying an output prosody;
changing the original prosody of the selected input speech segment to produce an output speech segment so that the prosody of the output speech segment matches the output prosody through steps comprising;
changing the prosody of a voiced component of a mixed portion of the input speech segment to produce an output voiced component;
changing the prosody of an unvoiced component of the mixed portion of the input speech segment to produce an output unvoiced component by generating a frequency-domain representation of the unvoiced component directly from the input speech segment and changing the frequency-domain representation to change the prosody of the unvoiced component;
combining the output voiced component and the output unvoiced component to produce an output mixed portion for an output speech segment; and
combining the output speech segment with other speech segments to form synthesized speech. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
creating a set of descriptor functions based on the original sets of spectral values, each descriptor function describing a respective frequency'"'"'s contribution to the output speech signal over time;
identifying a plurality of output time marks different than the plurality of input time marks; and
determining output sets of spectral values based on the output time marks and the descriptor functions.
-
-
5. The method of claim 4 wherein changing the prosody of the voiced component further comprises, before creating the set of descriptor functions, time shifting at least one input mark and its associated original set of spectral values such that the duration of a portion of the output voiced component is different than the duration of a corresponding portion of the voiced component of the input speech segment.
-
6. The method of claim 5 wherein creating a descriptor functions comprises interpolating between a plurality of spectral values.
-
7. The method of claim 6 wherein interpolating comprises filtering a plurality of spectral values over time such that the amount by which the spectral values can change over time is limited.
-
8. The method of claim 4 wherein the input speech segment comprises at least two speech units.
-
9. The method of claim 8 wherein creating a descriptor functions comprises filtering a plurality of spectral values over time such that the amount by which the spectral values can change between speech units is limited.
-
10. The method of claim 1 wherein generating the frequency-domain representation comprises generating original sets of spectral values with one set of values for each of a plurality of input time marks, each set of spectral values describing the magnitudes of a set of discrete frequencies that contribute to the content of a segment of the unvoiced component that extends along a period of time that includes the input time mark.
-
11. The method of claim 10 wherein changing the prosody of the unvoiced component further comprises:
-
creating a set of descriptor functions based on the original sets of spectral values, each descriptor function describing the magnitude of a respective frequency'"'"'s contribution to the output speech signal over time;
identifying a plurality of output time marks different than the plurality of input time marks; and
determining output sets of magnitudes based on the output time marks and the descriptor functions.
-
-
12. The method of claim 1 wherein changing the prosody of the unvoiced component further comprises adding spectral phases to the output sets of magnitudes to produce the output unvoiced component.
-
13. A method for synthesizing speech based on an input text comprising:
-
converting a time-domain training speech signal into a set of frequency-domain values;
quantizing the frequency-domain values into a set of codewords;
storing the codewords in a component database;
retrieving codewords from the component database based on the input text;
filtering the codewords directly to produce a descriptor function, the filtering such that the rate of change of the descriptor function is limited;
identifying an output set of frequency-domain values based on the descriptor function; and
converting the frequency-domain values to time-domain values representing portions of the synthesized speech. - View Dependent Claims (14, 15, 16, 17, 18)
-
-
19. A computer-readable medium having computer executable instructions for synthesizing speech from input speech segments, at least one input speech segment having an original prosody and having a mixed portion with a voiced component and an unvoiced component, the method comprising:
-
selecting an input speech segment;
identifying an output prosody;
changing the original prosody of the selected input speech segment to produce an output speech segment so that the prosody of the output speech segment matches the output prosody through steps comprising;
changing the prosody of a voiced component of a mixed portion of the input speech segment to produce an output voiced component;
changing the prosody of an unvoiced component of the mixed portion of the input speech segment to produce an output unvoiced component by generating a frequency-domain representation of the unvoiced component directly from the input speech segment and changing the frequency-domain representation to change the prosody of the unvoiced component;
combining the output voiced component and the output unvoiced component to produce an output mixed portion for an output speech segment; and
combining the output speech segment with other speech segments to form synthesized speech.
-
-
20. A computer-readable medium having computer-executable instructions for synthesizing speech based on an input text according to a method comprising:
-
retrieving codewords from a component database based on the input text, the codewords representing frequency-domain values indicative of a training speech signal;
filtering the codewords directly to produce a descriptor function, the filtering such that the rate of change of the descriptor function is limited;
identifying an output set of frequency-domain values based on the descriptor function; and
converting the frequency-domain values to time-domain values representing portions of the synthesized speech.
-
Specification