Converting text-to-speech and adjusting corpus

US 8,595,011 B2
Filed: 07/03/2008
Issued: 11/26/2013
Est. Priority Date: 05/31/2004
Status: Active Grant

First Claim

Patent Images

1. A method for text to speech conversion, comprising:

parsing, with at least one processor, input text to obtain descriptive prosody annotations of the text based, at least in part, on a text-to-speech model generated from a first corpus, wherein the descriptive prosody annotations include a prosody structure of the text, wherein the prosody structure of the text is associated with an initial speech speed, and wherein said prosody structure includes information selected from the group consisting of prosody word information, prosody phrase information, and intonation phrase information;

adjusting the prosody structure of the text based, at least in part, on a target speech speed for speech to be synthesized corresponding to the input text, wherein the target speech speed is different than the initial speech speed;

determining at least one prosody parameter of the text based, at least in part, on the adjusted prosody structure of the text; and

synthesizing speech corresponding to said input text based, at least in part, on said at least one prosody parameter of the text.

View all claims

8 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The present invention provides a method and apparatus for text to speech conversion, and a method and apparatus for adjusting a corpus. The method for text to speech comprises: text analysis step for parsing the text to obtain descriptive prosody annotations of the text based on a TTS model generated from a first corpus; prosody parameter prediction step for predicting the prosody parameter of the text according to the result of text analysis step; speech synthesis step for synthesizing speech of said text based on said the prosody parameter of the text; wherein descriptive prosody annotations of the text include prosody structure for the text, the prosody structure of the text is adjusted according to a target speech speed for the synthesized speech. The present invention adjusts the prosody structure of the text according to the target speech speed. The synthesized speech will have improved quality.

11 Citations

34 Claims

1. A method for text to speech conversion, comprising:
- parsing, with at least one processor, input text to obtain descriptive prosody annotations of the text based, at least in part, on a text-to-speech model generated from a first corpus, wherein the descriptive prosody annotations include a prosody structure of the text, wherein the prosody structure of the text is associated with an initial speech speed, and wherein said prosody structure includes information selected from the group consisting of prosody word information, prosody phrase information, and intonation phrase information;
  
  adjusting the prosody structure of the text based, at least in part, on a target speech speed for speech to be synthesized corresponding to the input text, wherein the target speech speed is different than the initial speech speed;
  
  determining at least one prosody parameter of the text based, at least in part, on the adjusted prosody structure of the text; and
  
  synthesizing speech corresponding to said input text based, at least in part, on said at least one prosody parameter of the text.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
- - 2. The method for text to speech conversion according to claim 1, wherein said descriptive prosody annotations of the text further include pronunciation and accent annotation.
  - 3. The method for text to speech conversion according to claim 1, further comprising:
    - acoustically evaluating the synthesized speech of the text; and
      
      adjusting the prosody structure of the text according to the acoustic evaluation result.
  - 4. The method for text to speech conversion according to claim 1, wherein said target speech speed corresponds to a speech speed of a second corpus.
  - 5. The method for text to speech conversion according to claim 1, further comprising:
    - adjusting the prosody parameter based, at least in part, on the target speech speed.
  - 6. The method for text to speech conversion according to claim 1, wherein adjusting the prosody structure of the text further comprises adjusting the intonation phrase of the text.
  - 7. The method for text to speech conversion according to claim 1, wherein said at least one prosody parameter of the text includes a value for pitch, duration and/or energy associated with the at least one prosody parameter.
  - 8. The method for text to speech conversion according to claim 7, wherein the at least one prosody parameter includes a value for duration of the at least one prosody parameter, and wherein adjusting the at least one prosody parameter comprises adjusting the value for the duration of the at least one prosody parameter based, at least in part, on the target speech speed.
  - 9. The method for text to speech conversion according to claim 1, wherein adjusting said prosody structure of the text comprises adjusting a distribution of prosody phrase length of the text.
  - 10. The method for text to speech conversion according to claim 9, wherein said first corpus has a first distribution of prosody phrase length corresponding to a first threshold for prosody boundary probability under a first speech speed;
    - wherein adjusting the distribution of the prosody phrase length of the text comprises adjusting the distribution of the prosody phrase length of the first corpus to produce an adjusted first corpus by adjusting the first threshold for prosody boundary probability; and
      
      wherein parsing the text comprises parsing the text based, at least in part, on the adjusted first corpus.
  - 11. The method for text to speech conversion according to claim 9, wherein adjusting the prosody phrase length distribution of the text comprises adjusting the distribution of prosody phrase with maximum length or maximum phrase number.
  - 12. The method for text to speech conversion according to claim 1, wherein said prosody structure includes information associated with prosody phrase, and wherein adjusting the prosody structure of the text comprises adjusting a distribution of prosody phrase length of the text to a target distribution.
  - 13. The method for text to speech conversion according to claim 4, wherein said first corpus has a first distribution for prosody phrase length corresponding to a first threshold for prosody boundary probability under a first speech speed, said second corpus has a second distribution for prosody phrase length corresponding to a second threshold for prosody boundary probability under said second speech speed, and wherein adjusting the prosody structure of the text comprises:
    - generating an adjusted first corpus by adjusting the first threshold for prosody boundary probability according to the target speech speed, such that the distribution for prosody phrase length of the first corpus matches the distribution for prosody phrase length of the second corpus; and
      
      wherein parsing the text comprises parsing the text based, at least in part, on the adjusted first corpus.
  - 14. The method for text to speech conversion according to claim 12, wherein adjusting the prosody phrase length distribution of the text comprises adjusting the prosody phrase length distribution of the text using a curve fitting method.

15. An apparatus for text to speech conversion, comprising:
- text analysis means for parsing input text to obtain descriptive prosody annotations of the text based on a text-to-speech model generated from a first corpus, wherein said descriptive prosody annotations of the text include a prosody structure of the text, wherein the prosody structure of the text is associated with an initial speech speed, and wherein said prosody structure includes information selected from the group consisting of prosody word information, prosody phrase information, and intonation phrase information;
  
  prosody parameter prediction means for predicting at least one prosody parameter of the text based, at least in part, on the parsed text;
  
  speech synthesis means for synthesizing speech corresponding to said input text based, at least in part, on said at least one prosody parameter of the text; and
  
  prosody structure adjusting means for adjusting the prosody structure of the text based, at least in part, on a target speech speed for the synthesized speech, wherein the target speech speed is different than the initial speech speed.
- View Dependent Claims (16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26)
- - 16. The apparatus for text to speech conversion according to claim 15, wherein said prosody structure adjusting means is further configured to adjust the intonation phrase of the text according to the target speech speed.
  - 17. The apparatus for text to speech conversion according to claim 15, wherein said prosody structure adjusting means is further configured to adjust a distribution of prosody phrase length of the text according to the target speech speed.
  - 18. The apparatus for text to speech conversion according to claim 15, wherein said at least one prosody parameter of the text includes a value for pitch, duration, and/or energy associated with the at least one prosody parameter.
  - 19. The apparatus for text to speech conversion according to claim 17, wherein said first corpus has a first distribution of prosody phrase length corresponding to a first threshold for prosody boundary probability under a first speech speed, wherein said prosody structure adjusting means is further configured to generate an adjusted first corpus by adjusting the distribution of the prosody phrase length of the first corpus by adjusting the first threshold for prosody boundary probability;
    - andwherein said text analysis means is further configured to parse the text according to the adjusted first corpus.
  - 20. The apparatus for text to speech conversion according to claim 17, wherein said prosody structure adjusting means is further configured to adjust the prosody phrase length distribution of the text by adjusting the distribution of prosody phrase with maximum length or maximum phrase number.
  - 21. The apparatus for text to speech conversion according to claim 15, wherein said target speech speed corresponds to a speech speed of a second corpus.
  - 22. The apparatus for text to speech conversion according to claim 21, wherein said first corpus has a first distribution for prosody phrase length corresponding to a first threshold for prosody boundary probability under a first speech speed, said second corpus has a second distribution for prosody phrase length corresponding to a second threshold for prosody boundary probability under said second speech speed, and wherein said prosody structure adjusting means is further configured to generate an adjusted first corpus by adjusting the first threshold for prosody boundary probability according to the target speech speed, such that the distribution for prosody phrase length of the first corpus matches that of the second corpus;
    - andwherein said text analysis means is further configured to parse the text according to the adjusted first corpus.
  - 23. The apparatus for text to speech conversion according to claim 15, wherein said prosody structure includes information associated with prosody phrase, and wherein said prosody structure adjusting means is further configured to adjust a distribution of prosody phrase length of the text to a target distribution.
  - 24. The apparatus for text to speech conversion according to claim 23, wherein said speech synthesis means is further configured to adjust the prosody phrase length distribution of the text using a curve fitting method.
  - 25. The apparatus for text to speech conversion according to claim 15, wherein said speech synthesis means is further configured to adjust the at least one prosody parameter according to the target speech speed.
  - 26. The apparatus for text to speech conversion according to claim 25, wherein the at least one prosody parameter includes a value for duration of the at least one prosody parameter, and wherein said speech synthesis means is further configured to adjust the value of the duration of the at least one prosody parameter based, at least in part, on the target speech speed.

27. A method for adjusting a first corpus used for text-to-speech conversion, said method comprising:
- building a decision tree for prosody structure prediction based on the first corpus, wherein the first corpus is associated with an initial speech speed;
  
  setting a target speech speed for an adjusted corpus, wherein the target speech speed is different than the initial speech speed;
  
  building a relationship between a distribution for prosody phrase length and the initial speech speed based, at least in part, on said decision tree; and
  
  generating, with at least one processor, the adjusted corpus by adjusting said distribution for prosody phrase length of the first corpus according to the target speech speed based, at least in part, on said decision tree and said relationship.
- View Dependent Claims (28)
- - 28. The method for adjusting a first corpus according to claim 27, wherein building the decision tree further comprises:
    - extracting prosody boundary context information for at least one word in the first corpus; and
      
      building said decision tree for prosody boundary prediction based, at least in part, on the prosody boundary context information.

29. An apparatus for adjusting a first corpus used for text-to-speech conversion, said apparatus comprising:
- means for building a decision tree for prosody structure prediction based on the first corpus, wherein the first corpus is associated with an initial speech speed;
  
  means for setting a target speech speed for an adjusted corpus, wherein the target speech speed is different than the initial speech speed;
  
  means for building a relationship between a distribution for prosody phrase length and the initial speech speed based, at least in part, on said decision tree; and
  
  means for generating the adjusted corpus by adjusting said distribution of prosody phrase length of the first corpus based, at least in part, on the target speech speed based on said decision tree and said relationship.
- View Dependent Claims (30)
- - 30. The apparatus for adjusting a text to speech corpus according to claim 29, wherein the means for building the decision tree is further configured to:
    - extract prosody boundary context information for at least one word in the first corpus; and
      
      build said decision tree for prosody boundary prediction based, at least in part, on the prosody boundary context information.

31. A non-transitory computer-readable medium encoded with a plurality of instructions that, when executed by a computer, perform a method, the method comprising:
- parsing input text to obtain descriptive prosody annotations of the text based, at least in part, on a text-to-speech model generated from a first corpus, wherein the descriptive prosody annotations include a prosody structure of the text, wherein the first corpus is associated with an initial speech speed;
  
  adjusting the prosody structure of the text based, at least in part, on a target speech speed, wherein the target speech speed is different than the initial speech speed, and wherein said prosody structure includes information selected from the group consisting of prosody word information, prosody phrase information, and intonation phrase information;
  
  determining at least one prosody parameter of the text based, at least in part, on the adjusted prosody structure of the text; and
  
  synthesizing speech corresponding to said input text based, at least in part, on said at least one prosody parameter of the text.

32. A non-transitory computer readable medium encoded with a plurality of instructions that, when executed by a computer, perform a method for adjusting a first corpus used for text-to-speech conversion, said method comprising:
- building a decision tree for prosody structure prediction based on the first corpus, wherein the first corpus is associated with an initial speech speed;
  
  setting a target speech speed for an adjusted corpus, wherein the target speech speed is different than the initial speech speed;
  
  building a relationship between a distribution for prosody phrase length and the initial speech speed based, at least in part, on said decision tree; and
  
  generating the adjusted corpus by adjusting said distribution for prosody phrase length of the first corpus according to the target speech speed based, at least in part, on said decision tree and said relationship.

33. An apparatus for text to speech conversion, comprising:
- at least one processor programmed to;
  
  parse input text to obtain descriptive prosody annotations of the text based on a text-to-speech model generated from a first corpus, wherein said descriptive prosody annotations of the text include a prosody structure of the text, wherein the first corpus is associated with an initial speech speed, and wherein said prosody structure includes information selected from the group consisting of prosody word information, prosody phrase information, and intonation phrase information;
  
  determine at least one prosody parameter of the text based, at least in part, on the parsed input text;
  
  synthesize speech corresponding to said input text based, at least in part, on said at least one prosody parameter of the text; and
  
  adjust the prosody structure of the text based, at least in part, on a target speech speed for the synthesized speech, wherein the target speech speed is different than the initial speech speed.

34. An apparatus for adjusting a first corpus used for text-to-speech conversion, said apparatus comprising:
- at least one processor programmed to;
  
  build a decision tree for prosody structure prediction based on the first corpus, wherein the first corpus is associated with an initial speech speed;
  
  set a target speech speed for an adjusted corpus, wherein the target speech speed is different than the initial speech speed;
  
  build a relationship between a distribution for prosody phrase length and the initial speech speed based, at least in part, on said decision tree; and
  
  generate the adjusted corpus by adjusting said distribution of prosody phrase length of the first corpus based, at least in part, on the target speech speed based on said decision tree and said relationship.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Cerence Operating Company (Cerence Inc.)
Original Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Inventors
Shi, Qin, Zhang, Wei, Zhu, Wei Bin, Chai, Hai Xin
Primary Examiner(s)
Opsasnick, Michael N

Application Number

US12/167,707
Publication Number

US 20080270139A1
Time in Patent Office

1,972 Days
Field of Search

704/260
US Class Current

704/260
CPC Class Codes

G10L 13/10 Prosody rules derived from ...

G10L 21/04 Time compression or expansion

Converting text-to-speech and adjusting corpus

First Claim

8 Assignments

0 Petitions

Accused Products

Abstract

11 Citations

34 Claims

Specification

Solutions

Use Cases

Quick Links

Converting text-to-speech and adjusting corpus

First Claim

8 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

11 Citations

34 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links