Converting text-to-speech and adjusting corpus
First Claim
1. A method for text to speech conversion, comprising:
- parsing, with at least one processor, input text to obtain descriptive prosody annotations of the text based, at least in part, on a text-to-speech model generated from a first corpus, wherein the descriptive prosody annotations include a prosody structure of the text, wherein the prosody structure of the text is associated with an initial speech speed, and wherein said prosody structure includes information selected from the group consisting of prosody word information, prosody phrase information, and intonation phrase information;
adjusting the prosody structure of the text based, at least in part, on a target speech speed for speech to be synthesized corresponding to the input text, wherein the target speech speed is different than the initial speech speed;
determining at least one prosody parameter of the text based, at least in part, on the adjusted prosody structure of the text; and
synthesizing speech corresponding to said input text based, at least in part, on said at least one prosody parameter of the text.
8 Assignments
0 Petitions
Accused Products
Abstract
The present invention provides a method and apparatus for text to speech conversion, and a method and apparatus for adjusting a corpus. The method for text to speech comprises: text analysis step for parsing the text to obtain descriptive prosody annotations of the text based on a TTS model generated from a first corpus; prosody parameter prediction step for predicting the prosody parameter of the text according to the result of text analysis step; speech synthesis step for synthesizing speech of said text based on said the prosody parameter of the text; wherein descriptive prosody annotations of the text include prosody structure for the text, the prosody structure of the text is adjusted according to a target speech speed for the synthesized speech. The present invention adjusts the prosody structure of the text according to the target speech speed. The synthesized speech will have improved quality.
11 Citations
34 Claims
-
1. A method for text to speech conversion, comprising:
-
parsing, with at least one processor, input text to obtain descriptive prosody annotations of the text based, at least in part, on a text-to-speech model generated from a first corpus, wherein the descriptive prosody annotations include a prosody structure of the text, wherein the prosody structure of the text is associated with an initial speech speed, and wherein said prosody structure includes information selected from the group consisting of prosody word information, prosody phrase information, and intonation phrase information; adjusting the prosody structure of the text based, at least in part, on a target speech speed for speech to be synthesized corresponding to the input text, wherein the target speech speed is different than the initial speech speed; determining at least one prosody parameter of the text based, at least in part, on the adjusted prosody structure of the text; and synthesizing speech corresponding to said input text based, at least in part, on said at least one prosody parameter of the text. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
-
-
15. An apparatus for text to speech conversion, comprising:
-
text analysis means for parsing input text to obtain descriptive prosody annotations of the text based on a text-to-speech model generated from a first corpus, wherein said descriptive prosody annotations of the text include a prosody structure of the text, wherein the prosody structure of the text is associated with an initial speech speed, and wherein said prosody structure includes information selected from the group consisting of prosody word information, prosody phrase information, and intonation phrase information; prosody parameter prediction means for predicting at least one prosody parameter of the text based, at least in part, on the parsed text; speech synthesis means for synthesizing speech corresponding to said input text based, at least in part, on said at least one prosody parameter of the text; and prosody structure adjusting means for adjusting the prosody structure of the text based, at least in part, on a target speech speed for the synthesized speech, wherein the target speech speed is different than the initial speech speed. - View Dependent Claims (16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26)
-
-
27. A method for adjusting a first corpus used for text-to-speech conversion, said method comprising:
-
building a decision tree for prosody structure prediction based on the first corpus, wherein the first corpus is associated with an initial speech speed; setting a target speech speed for an adjusted corpus, wherein the target speech speed is different than the initial speech speed; building a relationship between a distribution for prosody phrase length and the initial speech speed based, at least in part, on said decision tree; and generating, with at least one processor, the adjusted corpus by adjusting said distribution for prosody phrase length of the first corpus according to the target speech speed based, at least in part, on said decision tree and said relationship. - View Dependent Claims (28)
-
-
29. An apparatus for adjusting a first corpus used for text-to-speech conversion, said apparatus comprising:
-
means for building a decision tree for prosody structure prediction based on the first corpus, wherein the first corpus is associated with an initial speech speed; means for setting a target speech speed for an adjusted corpus, wherein the target speech speed is different than the initial speech speed; means for building a relationship between a distribution for prosody phrase length and the initial speech speed based, at least in part, on said decision tree; and means for generating the adjusted corpus by adjusting said distribution of prosody phrase length of the first corpus based, at least in part, on the target speech speed based on said decision tree and said relationship. - View Dependent Claims (30)
-
-
31. A non-transitory computer-readable medium encoded with a plurality of instructions that, when executed by a computer, perform a method, the method comprising:
-
parsing input text to obtain descriptive prosody annotations of the text based, at least in part, on a text-to-speech model generated from a first corpus, wherein the descriptive prosody annotations include a prosody structure of the text, wherein the first corpus is associated with an initial speech speed; adjusting the prosody structure of the text based, at least in part, on a target speech speed, wherein the target speech speed is different than the initial speech speed, and wherein said prosody structure includes information selected from the group consisting of prosody word information, prosody phrase information, and intonation phrase information; determining at least one prosody parameter of the text based, at least in part, on the adjusted prosody structure of the text; and synthesizing speech corresponding to said input text based, at least in part, on said at least one prosody parameter of the text.
-
-
32. A non-transitory computer readable medium encoded with a plurality of instructions that, when executed by a computer, perform a method for adjusting a first corpus used for text-to-speech conversion, said method comprising:
-
building a decision tree for prosody structure prediction based on the first corpus, wherein the first corpus is associated with an initial speech speed; setting a target speech speed for an adjusted corpus, wherein the target speech speed is different than the initial speech speed; building a relationship between a distribution for prosody phrase length and the initial speech speed based, at least in part, on said decision tree; and generating the adjusted corpus by adjusting said distribution for prosody phrase length of the first corpus according to the target speech speed based, at least in part, on said decision tree and said relationship.
-
-
33. An apparatus for text to speech conversion, comprising:
at least one processor programmed to; parse input text to obtain descriptive prosody annotations of the text based on a text-to-speech model generated from a first corpus, wherein said descriptive prosody annotations of the text include a prosody structure of the text, wherein the first corpus is associated with an initial speech speed, and wherein said prosody structure includes information selected from the group consisting of prosody word information, prosody phrase information, and intonation phrase information; determine at least one prosody parameter of the text based, at least in part, on the parsed input text; synthesize speech corresponding to said input text based, at least in part, on said at least one prosody parameter of the text; and adjust the prosody structure of the text based, at least in part, on a target speech speed for the synthesized speech, wherein the target speech speed is different than the initial speech speed.
-
34. An apparatus for adjusting a first corpus used for text-to-speech conversion, said apparatus comprising:
at least one processor programmed to; build a decision tree for prosody structure prediction based on the first corpus, wherein the first corpus is associated with an initial speech speed; set a target speech speed for an adjusted corpus, wherein the target speech speed is different than the initial speech speed; build a relationship between a distribution for prosody phrase length and the initial speech speed based, at least in part, on said decision tree; and generate the adjusted corpus by adjusting said distribution of prosody phrase length of the first corpus based, at least in part, on the target speech speed based on said decision tree and said relationship.
Specification