Prosody template matching for text-to-speech systems

US 20020128841A1
Filed: 01/05/2001
Published: 09/12/2002
Est. Priority Date: 01/05/2001
Status: Active Grant

First Claim

Patent Images

1. A text-to-speech synthesizer system, comprising:

a text input module receptive of target synthesis text;

a prosody module connected to the text input module for associating prosody information with the target synthesis text, the prosody module employing an n-way tree structure to identify the prosody information for the target synthesis text; and

a sound generation module connected to the prosody module for converting the target synthesis text to audible speech using the prosody information.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A prosody matching template in the form of a tree structure stores indices which point to lookup table and template information prescribing pitch and duration values that are used to add inflection to the output of a text-to-speech synthesizer. The lookup module employs a search algorithm that explores each branch of the tree, assigning penalty scores based on whether the syllable represented by a node of the tree does or does not match the corresponding syllable of the target word. The path with the lowest penalty score is selected as the index into the prosody template table. The system will add nodes by cloning existing nodes in cases where it is not possible to find a one-to-one match between the number of syllables in the target word and the number of nodes in the tree.

Citations

22 Claims

1. A text-to-speech synthesizer system, comprising:
- a text input module receptive of target synthesis text;
  
  a prosody module connected to the text input module for associating prosody information with the target synthesis text, the prosody module employing an n-way tree structure to identify the prosody information for the target synthesis text; and
  
  a sound generation module connected to the prosody module for converting the target synthesis text to audible speech using the prosody information.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The text-to-speech synthesizer system of claim 1 wherein the prosody module employs a tree structure that is based on stress patterns, such that each node of the tree structure corresponds to a stress level that may be associated with a syllabic portion of a text string.
  - 3. The text-to-speech synthesizer system of claim 2 wherein the text input module is operative to segment the target synthesis text into syllabic portions and to determine a stress level for each syllabic portion, thereby forming a stress pattern for the target synthesis text.
  - 4. The text-to-speech synthesizer system of claim 3 wherein the prosody module is operative to traverse the tree structure in order to identify a matching stress pattern that corresponds to the stress pattern for the target synthesis text and to retrieve the prosody information for the target synthesis text using the matching stress pattern.
  - 5. The text-to-speech synthesizer system of claim 1 wherein the prosody information is further defined as pitch modification information and duration modification information.

6. A method for generating synthesized speech, comprising the steps of:
- receiving an input text string;
  
  employing an n-way tree structure to identify prosody information for the input text string, where the tree structure is based on stress patterns such that each node of the tree structure provides a stress level that may be associated with a syllabic portion of a text spring; and
  
  converting the input text string into audible speech using the prosody information.
- View Dependent Claims (7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 18, 19, 20, 21, 22)
- - 7. The method of claim 6 further comprising the steps of:
    - segmenting the input text string into syllabic portions;
      
      determining a stress level for each syllabic portion of the input text string, thereby forming a stress pattern for the input text string;
      
      traversing the tree structure in order to identify a matching stress pattern that matches the stress pattern for the input text string; and
      
      using the matching stress pattern to retrieve the prosody information for the input text string.
  - 8. The method of claim 7 wherein the step of traversing the tree structure further comprises the steps of:
    - comparing a stress level for a syllabic portion of the input text string with a stress level for the corresponding syllabic portion in the tree structure;
      
      determining a matching score indicative of the correlation between the stress level for the syllabic portion of the input text string and the stress level for the corresponding syllabic portion in the tree structure; and
      
      using the matching score to identify a matching stress pattern that correlates to the stress pattern for the input text string.
  - 9. The method of claim 8 wherein the step of determining a matching score further comprises modifying the matching score based on at least one of the context of the syllabic portion within a transcription derived from the input text string and the context of a word within the input text string, where the word incorporates the syllabic portion used to determine the matching score.
  - 10. The method of claim 8 further comprises the steps of:
    - accumulating a matching score for each path that is traversed in the tree structure;
      
      storing a stress pattern having the lowest matching score;
      
      updating the stress pattern having the lowest matching score when the matching score for a given path is less than or equal to the lowest matching score; and
      
      ceasing to traverse a path in the tree structure when the matching score for the given path exceeds the lowest matching score.
  - 11. The method of claim 7 wherein the step of traversing the tree structure further comprises the step of constructing a stress pattern that correlates to the stress pattern of the input text string, when a matching stress pattern is not identified in the tree structure.
  - 12. The method of claim 11 wherein the step of constructing a stress pattern further comprises the steps of:
    - identifying one or more target nodes having stress patterns that correlate to the stress pattern of the input text string;
      
      cloning a stress level from an adjacent syllabic portion in the target node, when the number of syllabic portions in the target node is less than the number of syllabic portions in the input text string; and
      
      concatenating the stress level onto the stress pattern of the target node, thereby constructing a stress pattern that correlates to the stress pattern of the input text string.
  - 13. The method of claim 12 wherein the step of constructing a stress pattern further comprises the steps of:
    - determining a matching score indicative of the correlation between the stress patterns for each of the target nodes and the stress pattern of the input text string; and
      
      using the matching score to identify a target node that most closely correlates to the stress pattern of the input text string.
  - 14. The method of claim 13 further comprising the steps of:
    - retrieving the prosody information for the identified target node;
      
      cloning a portion of the prosody information that corresponds to the cloned adjacent syllabic portion of the target node; and
      
      concatenating the portion of the prosody information onto the remainder of the prosody information, thereby constructing the prosody information that corresponds to the identified target node.
  - 15. The method of claim 14 further comprising the step of converting the input text string to audible speech using the prosody information that corresponds to the identified target node.
  - 16. The method of claim 6 wherein the prosody information is further defined as pitch modification information and duration modification information.
  - 18. The method of claim 17 further comprising the steps of using the generated prosody template to retrieve prosody information for the input text string, and converting the input text string into audible speech using the prosody information.
  - 19. The method of claim 17 wherein each prosody template is further defined as a pattern of stress levels for each syllabic portion of a text string.
  - 20. The method of claim 19 wherein the step of determining a pattern of prosodic features further comprising the steps of:
    - segmenting the input text string into syllabic portions; and
      
      determining a stress level for each syllabic portion of the input text string, thereby forming a stress pattern for the input text string;
  - 21. The method of claim 20 wherein the step of identifying a first prosody template further comprises the step of traversing an n-way tree structure in order to identify a matching pattern of prosodic features, where the tree structures are based on stress patterns such that each node of the tree structure provides a stress level that may be associated with a syllabic portion of a text string.
  - 22. The method of claim 21 wherein the step of replicating a portion of the first prosody template further comprises the steps of cloning a stress level from an adjacent syllabic portion of the matching pattern, when the number of syllabic portions in the first prosody template is less than the number of syllabic portions of the stress pattern for the input text string, and concatenating the stress level onto the matching pattern of the first prosody template.

17. A method for generating prosody information for use in a text-to-speech synthesizer system, comprising the steps of:
- receiving an input text string;
  
  determining a pattern of prosodic features associated with the input text string;
  
  identifying a first prosody template from a plurality of prosody templates, where each prosody template represents a pattern of prosodic features that may be associated with a text string and the first prosody template having a pattern of prosodic features that correlate to the input text string;
  
  replicating a portion of the first prosody template, when the pattern for the first prosody template is shorter than the pattern for the input text string; and
  
  concatenating the replicated portion of the first prosody template onto the pattern of the first prosody template, thereby constructing a generated prosody template that more closely correlates to the input text string.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Panasonic Intellectual Property Corporation of America (Panasonic Holdings Corporation)
Original Assignee
Matsushita Electric Industrial Company Limited (Panasonic Holdings Corporation)
Inventors
Applebaum, Ted H., Kibre, Nicholas

Granted Patent

US 6,845,358 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/260
CPC Class Codes

G10L 13/10 Prosody rules derived from ...

Prosody template matching for text-to-speech systems

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

22 Claims

Specification

Solutions

Use Cases

Quick Links

Prosody template matching for text-to-speech systems

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

22 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links