Converting text-to-speech and adjusting corpus

US 20080270139A1
Filed: 07/03/2008
Published: 10/30/2008
Est. Priority Date: 05/31/2004
Status: Active Grant

First Claim

Patent Images

1. A method for text to speech conversion, comprising:

a text analysis step for parsing the text to obtain descriptive prosody annotations of the text based on a text to speech model generated from a first corpus;

a prosody parameter prediction step for predicting the prosody parameter of the text according to the result of text analysis step; and

a speech synthesis step for synthesizing speech of said text based on said predicted prosody parameter of the text;

Wherein descriptive prosody annotations of the text include prosody structure of the text, the prosody structure of the text is adjusted according to a target speech speed for the synthesized speech.

View all claims

8 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The present invention provides a method and apparatus for text to speech conversion, and a method and apparatus for adjusting a corpus. The method for text to speech comprises: text analysis step for parsing the text to obtain descriptive prosody annotations of the text based on a TTS model generated from a first corpus; prosody parameter prediction step for predicting the prosody parameter of the text according to the result of text analysis step; speech synthesis step for synthesizing speech of said text based on said the prosody parameter of the text; wherein descriptive prosody annotations of the text include prosody structure for the text, the prosody structure of the text is adjusted according to a target speech speed for the synthesized speech. The present invention adjusts the prosody structure of the text according to the target speech speed. The synthesized speech will have improved quality.

242 Citations

36 Claims

1. A method for text to speech conversion, comprising:
- a text analysis step for parsing the text to obtain descriptive prosody annotations of the text based on a text to speech model generated from a first corpus;
  
  a prosody parameter prediction step for predicting the prosody parameter of the text according to the result of text analysis step; and
  
  a speech synthesis step for synthesizing speech of said text based on said predicted prosody parameter of the text;
  
  Wherein descriptive prosody annotations of the text include prosody structure of the text, the prosody structure of the text is adjusted according to a target speech speed for the synthesized speech.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 33)
- - 2. The method for text to speech conversion according to claim 1, wherein said descriptive prosody annotations of the text further include pronunciation and accent annotation.
  - 3. The method for text to speech conversion according to claim 1, wherein said prosody parameters of the text include the value of pitch, duration and energy.
  - 4. The method for text to speech conversion according to claim 1, wherein said prosody structure includes prosody word, prosody phrase and intonation phrase.
  - 5. The method for text to speech conversion according to claim 4, wherein said prosody structure of the text is adjusted by adjusting the distribution of the prosody phrase length of the text.
  - 6. The method for text to speech conversion according to claim 5, wherein said first corpus has a first distribution of prosody phrase length corresponding to a first threshold for prosody boundary probability under a first speech speed, the distribution of the prosody phrase length of the text is adjusted by the following steps:
    - adjusting the distribution of the prosody phrase length of the first corpus by adjusting the first threshold for prosody boundary probability; and
      
      carrying out said text analysis step by parsing the text according to the adjusted first corpus.
  - 7. The method for text to speech conversion according to claim 1, further comprising the following steps:
    - acoustically evaluating the synthesized speech of the text; and
      
      adjusting the prosody structure of the text according to the acoustic evaluation result.
  - 8. The method for text to speech conversion according to claim 1, wherein said target speech speed corresponds to a second speech speed of a second corpus.
  - 9. The method for text to speech conversion according to claim 1, wherein said prosody structure includes prosody phrase, said prosody structure of the text is adjusted by adjusting the distribution of the prosody phrase length of the text to a target distribution.
  - 10. The method for text to speech conversion according to claim 8, wherein said first corpus having a first distribution for prosody phrase length corresponding to a first threshold for prosody boundary probability under a first speech speed, said second corpus having a second distribution for prosody phrase length corresponding to a second threshold for prosody boundary probability under said second speech speed, the prosody structure of the text is adjusted by the following steps:
    - adjusting the first threshold for prosody boundary probability according to the target speech speed, such that the distribution for prosody phrase length of the first corpus matches that of the second corpus; and
      
      carrying out the text analysis step by parsing the text according to the adjusted first corpus.
  - 11. The method for text to speech conversion according to claim 1, wherein the prosody parameter is adjusted according to the target speech speed.
  - 12. The method for text to speech conversion according to claim 3, wherein the duration of the prosody parameter is adjusted according to the target speech speed.
  - 13. The method for text to speech conversion according to claim 9, wherein the prosody phrase length distribution of the text is adjusted with a curve fitting method.
  - 14. The method for text to speech conversion according to claim 5, wherein the prosody phrase length distribution of the text is adjusted by adjusting the distribution of prosody phrase with maximum length or maximum phrase number.
  - 15. The method for text to speech conversion according to claim 4, wherein adjusting the prosody structure of the text further comprises adjusting the intonation phrase of the text.
  - 33. An article of manufacture comprising a computer usable medium having computer readable program code means embodied therein for causing text to speech conversion, the computer readable program code means in said article of manufacture comprising computer readable program code means for causing a computer to effect the steps of claim 1.

16. An apparatus for text to speech conversion, comprising:
- text analysis means for parsing the text to obtain descriptive prosody annotations of the text based on a text to speech model generated from a first corpus, said descriptive prosody annotations of the text include prosody structure of the text;
  
  prosody parameter prediction means for predicting the prosody parameter of the text according to the result of text analysis step;
  
  Speech synthesis means for synthesizing speech of said text based on said predicted prosody parameter of the text; and
  
  prosody structure adjusting means for adjusting the prosody structure of the text according to a target speech speed for the synthesized speech.
- View Dependent Claims (17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 35)
- - 17. The apparatus for text to speech conversion according to claim 16, wherein said prosody structure includes prosody word, prosody phrase and intonation phrase.
  - 18. The apparatus for text to speech conversion according to claim 17, wherein said prosody structure adjusting means is further configured to adjust the distribution of the prosody phrase length of the text according to the target speech speed.
  - 19. The apparatus for text to speech conversion according to claim 17, wherein said prosody structure adjusting means is further configured to adjust the intonation phrase of the text according to the target speech speed.
  - 20. The apparatus for text to speech conversion according to claim 18, wherein said first corpus has a first distribution of prosody phrase length corresponding to a first threshold for prosody boundary probability under a first speech speed,wherein said prosody structure adjusting means is further configured to adjust the distribution of the prosody phrase length of the first corpus by adjusting the first threshold for prosody boundary probability;
    - said text analysis means is further configured to parse the text according to the adjusted first corpus.
  - 21. The apparatus for text to speech conversion according to claim 16, wherein said prosody parameters of the text include the value of pitch, duration and energy.
  - 22. The apparatus for text to speech conversion according to claim 16, wherein said target speech speed corresponds to a second speech speed of a second corpus.
  - 23. The apparatus for text to speech conversion according to claim 16, wherein said prosody structure includes prosody phrase, said prosody structure adjusting means is further configured to adjust the distribution of the prosody phrase length of the text to a target distribution.
  - 24. The apparatus for text to speech conversion according to claim 22,wherein said first corpus having a first distribution for prosody phrase length corresponding to a first threshold for prosody boundary probability under a first speech speed, said second corpus having a second distribution for prosody phrase length corresponding to a second threshold for prosody boundary probability under said second speech speed,wherein said prosody structure adjusting means is further configured to adjust the first threshold for prosody boundary probability according to the target speech speed, such that the distribution for prosody phrase length of the first corpus matches that of the second corpus;
    - andwherein said text analysis means is further configured to parse the text according to the adjusted first corpus.
  - 25. The apparatus for text to speech conversion according to claim 16, wherein said speech synthesis means is further configured to adjust the prosody parameter according to the target speech speed.
  - 26. The apparatus for text to speech conversion according to claim 25, wherein the prosody parameter includes duration, said speech synthesis means is further configured to adjust the duration according to the target speech speed.
  - 27. The apparatus for text to speech conversion according to claim 23, wherein said speech synthesis means is further configured to adjust the prosody phrase length distribution of the text with curve fitting method
  - 28. The apparatus for text to speech conversion according to claim 18, wherein said prosody structure adjusting means is further configured to adjust the prosody phrase length distribution of the text by adjusting the distribution of prosody phrase with maximum length or maximum phrase number.
  - 35. A computer program product comprising a computer usable medium having computer readable program code means embodied therein for causing functions of an apparatus for text to speech conversion, the computer readable program code means in said computer program product comprising computer readable program code means for causing a computer to effect the functions of claim 16.

29. A method for adjusting a text to speech corpus, said corpus is a first corpus, said method comprising:
- building a decision tree for prosody structure prediction based on the first corpus;
  
  setting a target speech speed for the corpus;
  
  building the relationship between the distribution for prosody phrase length and the speech speed for the first corpus based on said decision tree; and
  
  adjusting said distribution for prosody phrase length of the first corpus according to the target speech speed based on said decision tree and said relationship.
- View Dependent Claims (30, 34)
- - 30. The method for adjusting a text to speech corpus according to claim 29, further comprising at least one limitation taken from a group of limitations consisting of:
    - wherein the step for building the decision tree further comprising steps;
      
      extracting the prosody boundaries'"'"' context information for every word in the first corpus,building said decision tree for prosody boundary prediction based on the prosody boundaries'"'"' context information;
      
      wherein the step for adjusting said distribution for prosody phrase length further comprising adjusting the distribution of the prosody phrase length of the first corpus according to said target speech speed to match a target distribution;
      
      wherein said target speech speed corresponding to a second speech speed of a second corpus;
      
      wherein said first corpus has a first distribution of prosody phrase length corresponding to a first threshold for prosody boundary probability under a first speech speed, said second corpus has a second distribution of prosody phrase length corresponding to a second threshold for prosody boundary probability under a second speech speed;
      
      wherein said step of adjusting said distribution being performed by adjusting the distribution of the prosody phrase length of the first corpus according to the distribution of the prosody phrase length of the second corpus;
      
      wherein the step for building the relationship between the distribution for prosody phrase length and the speech speed further comprising;
      
      building the relationship between the threshold for prosody boundary probability, the distribution for prosody phrase length and the speech speed for the first corpus;
      
      wherein the step for adjusting said distribution for prosody phrase length of the first corpus being carried out by adjusting the threshold for prosody boundary probability;
      
      wherein the prosody phrase length distribution of the text is adjusted with a curve fitting method; and
      
      wherein the prosody phrase length distribution is adjusted by adjusting the distribution of prosody phrase with maximum length or maximum phrase number.
  - 34. A program storage device readable by machine, tangibly embodying a program of instructions executable by the machine to perform method steps for adjusting a text to speech corpus, said corpus is a first corpus, said method steps comprising the steps of claim 29.

31. An apparatus for adjusting a text to speech corpus, said corpus is a first corpus, said apparatus comprising:
- means for building a decision tree for prosody structure prediction based on the first corpus;
  
  means for setting a target speech speed for the corpus;
  
  means for building the relationship between the distribution for prosody phrase length and the speech speed for the first corpus based on said decision tree; and
  
  means for adjusting said distribution of prosody phrase length of the first corpus according to the target speech speed based on said decision tree and said relationship.
- View Dependent Claims (32, 36)
- - 32. The apparatus for adjusting a text to speech corpus according to claim 31, further comprising at least one limitation taken from a group of limitations consisting of:
    - wherein the means for building the decision tree is further configured to;
      
      extract the prosody boundaries'"'"' context information for every word in the first corpus, andbuild said decision tree for prosody boundary prediction based on the prosody boundaries'"'"' context information;
      
      wherein the means for adjusting said distribution of prosody phrase length is further configured to adjust the distribution of the prosody phrase length of the first corpus according to said target speech speed to match a target distribution;
      
      wherein said target speech speed corresponding to a second speech speed of a second corpus;
      
      wherein said first corpus has a first distribution of prosody phrase length corresponding to a first threshold for prosody boundary probability under a first speech speed, said second corpus has a second distribution of prosody phrase length corresponding to a second threshold for prosody boundary probability under a second speech speed, wherein said means for adjusting the distribution is further configured to adjust the distribution of the prosody phrase length of the first corpus according to the distribution of the prosody phrase length of the second corpus;
      
      wherein said means for building the relationship between the distribution for prosody phrase length and the speech speed is further configured to build the relationship between the threshold for prosody boundary probability, the distribution for prosody phrase length and the speech speed for the first corpus;
      
      wherein said means for adjusting said distribution is further configured to adjust the distribution for prosody phrase length of the first corpus by adjusting the threshold for prosody boundary probability;
      
      wherein said means for adjusting is further configured to adjust the prosody phrase length distribution of the text with a curve fitting method;
      
      wherein said means for adjusting is further configured to adjust the prosody phrase length distribution by adjusting the distribution of prosody phrase with maximum length or maximum phrase number.
  - 36. A computer program product comprising a computer usable medium having computer readable program code means embodied therein for causing functions of an apparatus for adjusting a text to speech corpus, the computer readable program code means in said computer program product comprising computer readable program code means for causing a computer to effect the functions of claim 31.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Cerence Operating Company (Cerence Inc.)
Original Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Inventors
Shi, Qin, Zhang, Wei, Chai, Hai Xin, Zhu, Wei Bin

Granted Patent

US 8,595,011 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/260
CPC Class Codes

G10L 13/10 Prosody rules derived from ...

G10L 21/04 Time compression or expansion

Converting text-to-speech and adjusting corpus

First Claim

8 Assignments

0 Petitions

Accused Products

Abstract

242 Citations

36 Claims

Specification

Solutions

Use Cases

Quick Links

Converting text-to-speech and adjusting corpus

First Claim

8 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

242 Citations

36 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links