Linguistic prosodic model-based text to speech

US 6,961,704 B1
Filed: 01/31/2003
Issued: 11/01/2005
Est. Priority Date: 01/31/2003
Status: Expired due to Term

First Claim

Patent Images

1. A method, comprising:

generating at least one linguistic prosodic model, each of the at least one linguistic prosodic model characterizing a corresponding linguistic prosody and being used to facilitate unit selection during text to speech processing, wherein the at least one linguistic prosodic model is generated from the recorded speech of a target speaker;

receiving an input text for text to speech processing;

generating, according to the input text, a target unit sequence and a linguistic target which annotates the target units in the target unit sequence with a plurality of linguistic prosodic characteristics so that the speech synthesized in accordance with the target unit sequence and the linguistic target has certain desired prosodic properties; and

producing synthesized speech using a selected unit sequence determined in accordance with the target unit sequence and the linguistic target based on an estimated joint cost;

wherein estimating the joint cost comprises computing a linguistic prosody cost based on the at least one linguistic prosodic model;

computing a context cost based on at least one context cost function;

computing a mismatch cost based on a syllable position mismatch matrix with elements defining costs associated with different types of syllable position mismatch, a phrase position mismatch matrix with elements defining costs associated with different types of phrase position mismatch, and a stress/pitch accent mismatch matrix with elements defining costs associated with different types of stress/pitch accent mismatch;

computing a concatenation cost; and

combining the linguistic prosody cost, the context cost, the mismatch cost, and the concatenation cost to generate the joint cost.

View all claims

13 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An arrangement is provided for text to speech processing based on linguistic prosodic models. Linguistic prosodic models are established to characterize different linguistic prosodic characteristics. When an input text is received, a target unit sequence is generated with a linguistic target that annotates target units in the target unit sequence with a plurality of linguistic prosodic characteristics so that speech synthesized in accordance with the target unit sequence and the linguistic target has certain desired prosodic properties. A unit sequence is selected in accordance with the target unit sequence and the linguistic target based on joint cost information evaluated using established linguistic prosodic models. The selected unit sequence is used to produce synthesized speech corresponding to the input text.

Citations

47 Claims

1. A method, comprising:
- generating at least one linguistic prosodic model, each of the at least one linguistic prosodic model characterizing a corresponding linguistic prosody and being used to facilitate unit selection during text to speech processing, wherein the at least one linguistic prosodic model is generated from the recorded speech of a target speaker;
  
  receiving an input text for text to speech processing;
  
  generating, according to the input text, a target unit sequence and a linguistic target which annotates the target units in the target unit sequence with a plurality of linguistic prosodic characteristics so that the speech synthesized in accordance with the target unit sequence and the linguistic target has certain desired prosodic properties; and
  
  producing synthesized speech using a selected unit sequence determined in accordance with the target unit sequence and the linguistic target based on an estimated joint cost;
  
  wherein estimating the joint cost comprises computing a linguistic prosody cost based on the at least one linguistic prosodic model;
  
  computing a context cost based on at least one context cost function;
  
  computing a mismatch cost based on a syllable position mismatch matrix with elements defining costs associated with different types of syllable position mismatch, a phrase position mismatch matrix with elements defining costs associated with different types of phrase position mismatch, and a stress/pitch accent mismatch matrix with elements defining costs associated with different types of stress/pitch accent mismatch;
  
  computing a concatenation cost; and
  
  combining the linguistic prosody cost, the context cost, the mismatch cost, and the concatenation cost to generate the joint cost.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
- - 2. The method according to claim 1, wherein the at least one model includes at least one of:
    - a distribution in a feature space;
      
      a function represented by one or more parameters; and
      
      a decision tree.
  - 3. The method according to claim 2, wherein the function includes a statistical function.
  - 4. The method according to claim 3, wherein the statistical function includes a Gaussian function.
  - 5. The method according to claim 1, wherein a unit includes any combination of any sequence of contiguous or non-contiguous half-phase units.
  - 6. The method according to claim 1, wherein said generating at least one linguistic prosodic model comprises:
    - generating labeled training data, wherein each training sample in the labeled training data is labeled with at least one linguistic prosody;
      
      identifying a portion of the labeled training data with at least one training sample that has a label corresponding to a distinct linguistic prosody to be modeled;
      
      extracting at least one acoustic feature from each training sample within the portion of the labeled training data;
      
      determining one or more parameters of a linguistic prosodic model based on the at least one acoustic feature, wherein the one or more parameters represent the linguistic prosodic model that characterizes the distinct linguistic prosody.
  - 7. The method according to claim 6, wherein said identifying comprises:
    - training a decision tree using the labeled training data, wherein leaf nodes of the decision tree correspond to different portions of the labeled training data;
      
      selecting one leaf node in the decision tree that corresponds to the distinct linguistic prosody to be modeled.
  - 8. The method according to claim 6, wherein said identifying comprises determining the portion of the labeled training data based on a label representing the distinct linguistic prosody to be modeled.
  - 9. The method according to claim 1, wherein said producing synthesized speech comprises:
    - receiving the target unit sequence with the linguistic target;
      
      identifying one or more candidate unit sequences, each of which comprises a plurality of units selected in accordance with the target unit sequence and the linguistic target;
      
      selecting one of the candidate unit sequences as the selected unit sequence that has a minimum joint cost; and
      
      synthesizing the speech using the selected until sequence.
  - 10. The method according to claim 1, wherein the linguistic prosody cost includes at least one of:
    - a pitch cost;
      
      an energy cost; and
      
      a duration cost.
  - 11. The method according to claim 1, wherein the joint cost is computed as a linear combination of the linguistic prosody cost, the context cost, the mismatch cost, and the concatenation cost.
  - 12. The method according to claim 11, wherein the linear combination includes any one of:
    - a summation; and
      
      a weighted sum.
  - 13. The method according to claim 1, wherein the linguistic prosodic model includes at least one of:
    - a distribution in a feature space;
      
      a function represented by one or more parameters; and
      
      a decision tree.
  - 14. The method according to claim 13, wherein the function includes a statistical function.
  - 15. The method according to claim 14, wherein the statistical function includes a Gaussian function.

16. A method for unit selection using at least one linguistic prosodic model, comprising:
- receiving a target unit sequence with a linguistic target, wherein the linguistic target annotates the target units in the target unit sequence with a plurality of linguistic prosodic characteristics so that the speech synthesized in accordance with the target unit sequence and the linguistic target has certain desired prosodic properties;
  
  identifying one or more candidate unit sequences, each of which comprises a plurality of units selected in accordance with the target unit sequence and the linguistic target;
  
  estimating a joint cost associated with each of the candidate unit sequences, wherein said estimating the joint cost comprises computing a linguistic prosody cost based on the at least one linguistic prosodic model, computing a context cost based on at least one context cost function, computing a mismatch cost based on a syllable mismatch matrix with elements defining costs associated with different types of syllable mismatch, a phrase position mismatch matrix with elements defining costs associated with different types of phrase position mismatch, and a stress/pitch accent mismatch matrix with elements defining costs associated with the different types of stress/pitch accent mismatch;
  
  computing a concatenation cost;
  
  combining the linguistic prosody cost, the context cost, the mismatch cost, and the concatenation cost to generate the joint cost; and
  
  selecting one of the candidate unit sequences to be a selected unit sequence that has a minimum joint cost.
- View Dependent Claims (17, 18, 19)
- - 17. The method according to claim 16, wherein the linguistic prosody cost includes at least one of:
    - a pitch cost;
      
      an energy cost; and
      
      a duration cost.
  - 18. The method according to claim 16, wherein the joint cost is computed as an linear combination of the linguistic prosody cost the context cost the mismatch cost and the concatenation cost.
  - 19. The method according to claim 18, wherein the linear combination includes any one of:
    - a summation; and
      
      a weighted sum.

20. A unit selection based text to speech system, comprising:
- a linguistic prosodic model generation mechanism;
  
  a text-to-speech front end capable of generating, according to an input text, a target unit sequence and a linguistic target that annotates the target units in the target unit sequence with a plurality of linguistic prosodic characteristics so that the speech synthesized in accordance with the target sequence and the linguistic target has certain desired prosodic properties;
  
  a unit selection mechanism capable of selecting a unit sequence in accordance with the target unit sequence and the linguistic target based on an estimated joint cost wherein estimating the joint cost comprises computing a linguistic prosody cost based on the at least one linguistic prosodic model, computing a context cost based on at least one context cost function, computing a mismatch cost based on a syllable mismatch matrix with elements defining costs associated with different types of syllable mismatch, a phrase position mismatch matrix with elements defining costs associated with different types of phrase position mismatch, and a stress/pitch accent mismatch matrix with elements defining costs associated with different types of stress/pitch accent mismatch;
  
  computing a concatenation cost;
  
  combining the linguistic prosody cost, the context cost, the mismatch cost, and the concatenation cost to generate the joint cost; and
  
  a speech synthesis mechanism capable of synthesizing speech using the selected unit sequence.
- View Dependent Claims (21, 22, 23, 24, 25)
- - 21. The system according to claim 20, wherein the text-to-speech front end comprises:
    - a text normalization mechanism capable of normalizing an input text for text-to-speech processing to produce a normalized text;
      
      a linguistic analysis mechanism capable of performing linguistic analysis on the normalized text to produce the target unit sequence; and
      
      a linguistic target generation mechanism capable of generating the linguistic target with respect to the target unit sequence.
  - 22. The system according to claim 20, wherein the linguistic prosodic model generation mechanism comprises:
    - an acoustic feature extraction mechanism capable of extracting, for each linguistic prosodic model to be generated, at least one acoustic feature from a portion of labeled training data, wherein training samples included in the portion have a distinct label corresponding to a linguistic prosody to be modeled; and
      
      a model parameter estimation mechanism capable of determining one or more parameters of the linguistic prosodic model based on the at least one acoustic feature.
  - 23. The system according to claim 20, wherein the unit selection mechanism comprises:
    - a unit search mechanism capable of identifying one or more candidate unit sequences, each of which comprises a plurality of units selected in accordance with the target unit sequence and the linguistic target;
      
      a cost estimation mechanism capable of estimating a joint cost for each of the candidate unit sequences using the at least one linguistic prosodic model; and
      
      a unit sequence selection mechanism capable of selecting one of the candidate unit sequence as the selected unit sequence that has a minimum joint cost.
  - 24. The mechanism according to claim 20, wherein the linguistic prosodic model includes at least one of:
    - a distribution;
      
      a function represented by one or more parameters; and
      
      a decision tree.
  - 25. The mechanism according to claim 24, wherein the function includes a statistical function.

26. A unit selection mechanism, comprising:
- a unit search mechanism capable of identifying one or more candidate unit sequences in accordance with a target unit sequence and a linguistic target, wherein the linguistic target annotates the target unit sequence with a plurality of linguistic prosodic characteristics so that speech synthesized based on the target unit sequence and the linguistic target has certain desired prosodic properties;
  
  a cost estimation mechanism capable of estimating a joint cost, for each of the candidate unit sequences, using at least one linguistic prosodic model generated to characterize at least one linguistic prosody;
  
  wherein the cost estimation mechanism comprises a linguistic prosody cost estimator capable of computing a linguistic prosody cost associated with a candidate unit sequence based on at least some of the linguistic prosodic models, a mismatch cost estimator capable of computing a mismatch cost of the candidate unit sequence based on a syllable mismatch matrix with elements defining costs associated with syllable mismatches, a phrase position mismatch matrix with elements defining costs associated with phrase position mismatches, and a stress/pitch accent mismatch matrix with elements defining costs associated with different types of stress/pitch accent mismatch;
  
  a context cost estimator capable of computing a context cost of the candidate unit sequence based on context cost functions;
  
  a concatenation cost estimator capable of computing a concatenation cost of the candidate unit sequence;
  
  a joint cost computation mechanism capable of combining the linguistic prosody cost, the context cost, the mismatch cost, and the concatenation cost to generate the joint cost associated with the candidate unit sequence; and
  
  a unit sequence selection mechanism capable of determining a selected unit sequence from the candidate unit sequences that best matches with the target unit sequence and the linguistic target based on the joint cost.

27. An article comprising a storage medium having stored thereon instructions that, when executed by a machine, result in the following:
- generating at least one linguistic prosodic model, each of the at least one linguistic prosodic model characterizing a corresponding linguistic prosody and being used to facilitate unit selection during text to speech processing, wherein the at least one linguistic prosodic model is generated from the speech from a target speaker;
  
  receiving an input text for text to speech processing;
  
  generating, according to the input text, a target unit sequence and a linguistic target which annotates the target units in the target unit sequence with a plurality of linguistic prosodic characteristics so that the speech synthesized in accordance with the target unit sequence and the linguistic target has certain desired prosodic properties; and
  
  producing synthesized speech using a selected unit sequence determined in accordance with the target unit sequence and the linguistic target based on an estimated joint cost wherein estimating the joint cost comprises computing a linguistic prosody cost based on the at least one linguistic prosodic model, computing a context cost based on at least one context cost function, computing a mismatch cost based on a syllable mismatch matrix with elements defining costs associated with different types of syllable mismatch, a phrase position mismatch matrix with elements defining costs associated with different types of phrase position mismatch, and a stress/pitch accent mismatch matrix with elements defining costs associated with different types of stress/pitch accent mismatch, computing a concatenation cost; and
  
  combining the linguistic prosody cost, the context cost, the mismatch cost, and the concatenation cost to generate the joint cost.
- View Dependent Claims (28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39)
- - 28. The article according to claim 27, wherein the at least one model includes at least one of:
    - a distribution in a feature space;
      
      a function represented by one or more parameters; and
      
      a decision tree.
  - 29. The article according to claim 28, wherein the function includes a statistical function.
  - 30. The article according to claim 29, wherein the statistical function includes a Gaussian function.
  - 31. The article according to claim 27, wherein said generating at least one linguistic prosodic model comprises:
    - generating labeled training data, wherein each training sample in the labeled training data is labeled with at least one linguistic prosody;
      
      identifying a portion of the labeled training data with at least one training sample that has a label corresponding to a distinct linguistic prosody to be modeled;
      
      extracting at least one acoustic feature from each training sample within the portion of the labeled training data; and
      
      determining one or more parameters of a linguistic prosodic model based on the at least one acoustic feature, wherein the one or more parameters represent the linguistic prosodic model that characterizes the distinct linguistic prosody.
  - 32. The article according to claim 27, wherein said producing synthesized speech comprises:
    - receiving the target unit sequence with the linguistic target;
      
      identifying one or more candidate unit sequences, each of which comprises a plurality of units selected in accordance with the target unit sequence and the linguistic target;
      
      estimating a joint cost for each of the candidate unit sequences using the at least one linguistic prosodic model;
      
      selecting one of the candidate unit sequences as the selected unit sequence that has a minimum joint cost; and
      
      synthesizing the speech using the selected unit sequence.
  - 33. The article according to claim 27, wherein the joint cost is computed as an linear combination of the linguistic prosody cost, the context cost, the mismatch cost, and the concatenation cost.
  - 34. The article according to claim 27, comprising a storage medium having stored thereon instructions for generating a linguistic prosodic model for text to speech processing that, when executed by a machine, result in the following:
    - generating labeled training data, wherein each training sample in the labeled training data is from a target speaker and is labeled with at least one linguistic prosody;
      
      identifying a portion of the labeled training data with at least one training sample that has a label corresponding to a distinct linguistic prosody to be modeled;
      
      extracting at least one acoustic feature from each training sample of the portion of the labeled training data; and
      
      determining one or more parameters of a linguistic prosodic model based on the at least one acoustic feature, wherein the one or more parameters represent the linguistic prosodic model that characterizes the distinct linguistic prosody.
  - 35. The article according to claim 34, wherein the linguistic prosodic model includesat least one of:
    - a distribution in a feature space;
      
      a function represented by one or more parameters; and
      
      a decision tree.
  - 36. The article according to claim 35, wherein the function includes a statistical function.
  - 37. The article according to claim 36, wherein the statistical function includes a Gaussian function.
  - 38. The article according to claim 34, wherein said identifying comprises:
    - training a decision tree using the labeled training data, wherein leaf nodes of the decision tree correspond to different portions of the labeled training data;
      
      selecting one loaf node in the decision tree that corresponds to the distinct linguistic prosody to be modeled.
  - 39. The article according to claim 34, wherein said identifying comprises determining the portion of the labeled training data based on a label representing the distinct linguistic prosody to be modeled.

40. An article comprising a storage medium having stored thereon instructions for unit selection using at least one linguistic prosodic model that, when executed by a machine, result in the following:
- receiving a target unit sequence with a linguistic target, wherein the linguistic target annotates the target units in the target unit sequence with a plurality of linguistic prosodic characteristics so that the speech synthesized in accordance with the target unit sequence and the linguistic target has certain desired prosodic properties;
  
  identifying one or more candidate unit sequences, each of which comprises a plurality of units selected in accordance with the target unit sequence and the linguistic target;
  
  estimating a joint cost associated with each of the candidate unit sequences wherein said estimating the joint cost comprises computing a linguistic prosody cost based on the at least one linguistic prosodic model;
  
  computing a context cast based on at least one context cost function;
  
  computing a mismatch cost based on a syllable mismatch matrix with elements defining costs associated with different types of syllable mismatch, a phrase position mismatch matrix with elements defining costs associated with different types of phrase position mismatch, and a stress/pitch accent mismatch matrix with elements defining costs associated with different types of stress/pitch accent mismatch;
  
  computing a concatenation cost; and
  
  combining the linguistic prosody cost, the context cost, the mismatch cost, and the concatenation cost to generate the joint cost; and
  
  selecting one of the candidate unit sequences to be a selected unit sequence that has a minimum joint cost.
- View Dependent Claims (41, 42, 43, 44, 45, 46, 47)
- - 41. The article according to claim 40, wherein the joint cost is computed as a linear combination of the linguistic prosody cost, the context cost, the mismatch cost, and the concatenation cost.
  - 42. The article according to claim 40, wherein the at least one model includes at least one of:
    - a distribution in a feature space;
      
      a function represented by one or more parameters; and
      
      a decision tree.
  - 43. The article according to claim 42, wherein the function includes a statistical function.
  - 44. The article according to claim 43, wherein the statistical function includes a Gaussian function.
  - 45. The article according to claim 40, wherein the joint cost is computed as a linear combination of the linguistic prosody cost, the context cost, the mismatch cost, and the concatenation cost.
  - 46. The article according to claim 45, wherein the linear combination includes any one of:
    - a summation; and
      
      a weighted sum.
  - 47. The article according to claim 40, wherein the linguistic prosody cost includes at least one of:
    - a pitch cost;
      
      an energy cost; and
      
      a duration cost.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Cerence Operating Company (Cerence Inc.)
Original Assignee
SpeechWorks International, Inc. (Microsoft Corporation)
Inventors
Faulkner, Daniel S., Przezdzieci, Marek A., Phillips, Michael S.
Primary Examiner(s)
Dorvil, Richemond
Assistant Examiner(s)
Storm, Donald L.

Application Number

US10/355,296
Time in Patent Office

1,005 Days
Field of Search

704/268, 704/267, 704/260, 704/258
US Class Current

704/268
CPC Class Codes

G10L 13/10 Prosody rules derived from ...

Linguistic prosodic model-based text to speech

First Claim

13 Assignments

0 Petitions

Accused Products

Abstract

Citations

47 Claims

Specification

Solutions

Use Cases

Quick Links

Linguistic prosodic model-based text to speech

First Claim

13 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

47 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links