Text-to-speech using clustered context-dependent phoneme-based units

US 6,163,769 A
Filed: 10/02/1997
Issued: 12/19/2000
Est. Priority Date: 10/02/1997
Status: Expired due to Term

First Claim

Patent Images

1. A method for generating speech from text, comprising the steps of:

storing a set of decision tree context-dependent phoneme-based units of a target speaker, wherein a central phoneme-based unit is selected from a group consisting of a phoneme and a diphone, wherein each context-dependent phoneme-based unit is arranged based on context of at least one immediately preceding and succeeding phoneme-based unit, and wherein one context-dependent phoneme-based unit is chosen to represent each leaf node in the decision trees;

obtaining a string of phonetic symbols representative of a text to be converted to speech;

selecting stored decision-tree based context-dependent phoneme-based units from the set of decision tree based context-dependent phoneme-based units based on the contexts of the phonetic symbols; and

synthesizing the selected context-based phoneme-based units to generate speech corresponding to the text.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A text-to-speech system includes a storage device for storing a clustered set of context-dependent phoneme-based units of a target speaker. In one embodiment, decision trees are used wherein each decision tree based context-dependent phoneme-based unit is arranged based on context of at least one immediately preceding and succeeding phoneme. At least one of the context-dependent phoneme-based units represents other non-stored context-dependent phoneme units of similar sound due to similar contexts. A text analyzer obtains a string of phonetic symbols representative of text to be converted to speech. A concatenation module selects stored decision tree based context-dependent phoneme-based units from the set decision tree based context-dependent phoneme-based units based on the context of the phonetic symbols and synthesizes the selected phoneme-based units to generate speech corresponding to the text.

Citations

31 Claims

1. A method for generating speech from text, comprising the steps of:
- storing a set of decision tree context-dependent phoneme-based units of a target speaker, wherein a central phoneme-based unit is selected from a group consisting of a phoneme and a diphone, wherein each context-dependent phoneme-based unit is arranged based on context of at least one immediately preceding and succeeding phoneme-based unit, and wherein one context-dependent phoneme-based unit is chosen to represent each leaf node in the decision trees;
  
  obtaining a string of phonetic symbols representative of a text to be converted to speech;
  
  selecting stored decision-tree based context-dependent phoneme-based units from the set of decision tree based context-dependent phoneme-based units based on the contexts of the phonetic symbols; and
  
  synthesizing the selected context-based phoneme-based units to generate speech corresponding to the text.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1 wherein phoneme-based unit comprises a phoneme and wherein the context-dependent phoneme-based unit is a triphone, a phoneme in the context of the one immediately preceding and succeeding phonemes.
  - 3. The method of claim 1 wherein the phoneme-based unit comprises a phoneme and wherein the context-dependent phoneme-based unit comprises a quinphone, a phoneme in the context of the two immediately preceding and succeeding phonemes.
  - 4. The method of claim 1 wherein the step of storing includes storing at least two decision tree based context-dependent phoneme-based units representing other non-stored context-dependent phoneme-based units of similar sound due to similar contexts, and wherein the step of selecting includes selecting one of said at least two decision tree base context-dependent phoneme-based units to minimize a joint distortion function.
  - 5. The method of claim 4 wherein the joint distortion function comprises at least one of a HMM score, phoneme-based unit concatenation distortion and prosody mismatch distortion.
  - 6. The method of claim 1 wherein each decision tree includes:
    - a root node corresponding to one of the plurality of phoneme-based units spoken by the target speaker;
      
      leaf nodes corresponding to decision tree based context-dependent phoneme-based units; and
      
      linguistic questions to traverse the decision tree from the root node to the leaf nodes; and
      
      wherein the step of selecting includes traversing the decision trees to select the stored decision tree based context-dependent phoneme-based units.
  - 7. The method of claim 6 wherein the linguistic questions comprise complex linguistic questions.

8. An apparatus for generating speech from text, comprising:
- storage means for storing a set of decision tree based context-dependent phoneme-based units of a target speaker, wherein a central phoneme-based unit is selected from a group consisting of a phoneme and a diphone, wherein each context-dependent phoneme-based unit is arranged based on context of at least one immediately preceding and succeeding phoneme-based unit, and wherein at least one of the context-dependent phoneme-based units represents other non-stored context-dependent phoneme-based units of similar sound due to similar contexts;
  
  a text analyzer for obtaining a string of phonetic symbols representative of a text to be converted to speech; and
  
  a concatenation module for selecting stored decision tree base context-dependent phoneme-based units from the set of decision tree based context-dependent phoneme-based units based on the context of the phonetic symbols and synthesizing the selected context-dependent phoneme-based units to generate speech corresponding to the text.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The apparatus of claim 8 wherein the phoneme-based unit comprises a phoneme and wherein the context-dependent phoneme-based unit is a triphone, a phoneme in the context of the one immediately preceding and succeeding phonemes.
  - 10. The apparatus of claim 8 wherein the phoneme-based unit comprises a phoneme and wherein the context-dependent phoneme-based unit comprises a quinphone, a phoneme in the context of the two immediately preceding and succeeding phonemes.
  - 11. The apparatus of claim 8 wherein the storage means includes at least two decision tree based context-dependent phoneme-based units representing other non-stored decision tree base context-dependent phoneme-based units of similar sound due to similar context, and wherein the concatenation module selects one of said at least two decision tree based context-dependent phoneme-based units to minimize a joint distortion function.
  - 12. The apparatus of claim 11 wherein the joint distortion function comprises at least one of a HMM score, phoneme-based unit concatenation distortion and prosody mismatch distortion.
  - 13. The apparatus of claim 8 wherein each decision tree includes:
    - a root node corresponding to one of the plurality of phoneme-based units spoken by the target speaker;
      
      leaf nodes corresponding to stored to decision tree based context-dependent phoneme-based units; and
      
      linguistic questions to traverse the decision tree from the root node to the leaf nodes.
  - 14. The apparatus of claim 13 wherein the linguistic questions comprise complex linguistic questions.

15. A method for creating context dependent synthesis units of a text-to-speech system, the method comprising the steps of:
- storing input speech from a target speaker and corresponding phonetic symbols of the input speech;
  
  identifying each unique context-dependent phoneme-based unit of the input speech, wherein a central phoneme-based unit is selected from a group consisting of a phoneme and a diphone;
  
  training a Hidden Markov Model (HMM) for each unique context-dependent phoneme-based unit based on context of at least one immediately preceding and succeeding phoneme-based units;
  
  clustering the HMMs into groups having the same central phoneme-based unit that sound similar but have different preceding or succeeding phoneme-based units; and
  
  selecting a context-dependent phoneme-based unit of each group to represent the corresponding group.
- View Dependent Claims (16, 17, 18, 19, 20, 21)
- - 16. The method of claim 15 wherein the step of selecting includes selecting at least two context-dependent phoneme-based units to represent at least one of the groups.
  - 17. The method of claim 15 wherein the phoneme-based unit comprises a phoneme and wherein the context-dependent phoneme-based unit is a triphone, a phoneme in the context of the one immediately preceding and succeeding phonemes.
  - 18. The method of claim 15 wherein context-dependent phoneme-based unit comprises a phoneme and wherein the context comprises a quinphone, a phoneme in the context of the two immediately preceding and succeeding phonemes.
  - 19. The method of claim 15 wherein the step of clustering includes k-means clustering.
  - 20. The method of claim 19 wherein the step of clustering includes forming a decision tree for each central phoneme-based unit spoken by the target speaker, wherein each decision tree includes:
    - a root node corresponding to one of the plurality of phoneme-based units spoken by the target speaker;
      
      leaf nodes corresponding to clustered HMMs; and
      
      linguistic questions to traverse the decision tree from the root node to the leaf nodes.
  - 21. The method of claim 20 wherein the linguistic questions comprise complex linguistic questions.

22. An apparatus for creating context dependent synthesis phoneme-based units of a text-to-speech system, the method comprising the steps of:
- means for storing input speech from a target speaker and corresponding phonetic symbols of the input speech;
  
  a training module for identifying each unique context-dependent phoneme-based unit of the input speech and training a Hidden Markov Model (HMM) for each unique context-dependent phoneme-based unit based on context of at least one immediately preceding and succeeding phoneme-based unit, wherein a central phoneme-based unit is selected from a group consisting of a phoneme and a diphone;
  
  a clustering module for clustering the HMMs into groups having the same central phoneme-based unit that sound similar but have different preceding or succeeding phoneme-based units and selecting one of context-dependent phoneme-based unit of each group to represent the corresponding group.
- View Dependent Claims (23, 24, 25, 26, 27, 28)
- - 23. The apparatus of claim 22 wherein the clustering module selects at least two context-dependent phoneme-based units to represent at least one of the groups.
  - 24. The apparatus of claim 22 wherein the phoneme-based unit comprises a phoneme and wherein the context-dependent phoneme-based unit is a triphone, a phoneme in the context of the one immediately preceding and succeeding phonemes.
  - 25. The apparatus of claim 22 wherein the phoneme-based unit comprises a phoneme and wherein the context-dependent phoneme-based unit comprises a quinphone, a phoneme in the context of the two immediately preceding and succeeding phonemes.
  - 26. The apparatus of claim 22 wherein the clustering module clusters HMMs using k-means clustering.
  - 27. The apparatus of claim 26 wherein the clustering module forms a decision tree for each central phoneme-based unit spoken by the target speaker, wherein each decision tree includes:
    - a root node corresponding to one of the plurality of phoneme-based units spoken by the target speaker;
      
      leaf nodes corresponding to clustered HMMs; and
      
      linguistic questions to traverse the decision tree from the root node to the leaf nodes.
  - 28. The apparatus of claim 27 wherein the linguistic questions comprise complex linguistic questions.

29. A method for generating speech from text, comprising the steps of:
- storing a set of HMM context-dependent phoneme-based units of a target speaker, wherein a central phoneme-based unit is selected from a group consisting of a phoneme and a diphone, wherein each HMM context-dependent phoneme-based unit is arranged based on context of at least one immediately preceding and succeeding phoneme-based unit, and wherein at least one of the HMM context-dependent phoneme-based units represents other non-stored HMM context-dependent phoneme-based units of similar sound due to context;
  
  obtaining a string of phonetic symbols representative of a text to be converted to speech;
  
  selecting stored HMM context-dependent phoneme-based units from the set of HMM context-dependent phoneme-based units based on the context of the phonetic symbols; and
  
  synthesizing the selected HMM context-dependent phoneme-based units to generate speech corresponding to the text.
- View Dependent Claims (30, 31)
- - 30. The method of claim 29 wherein the phoneme-based unit comprises a phoneme and wherein the context-dependent phoneme-based unit is a triphone.
  - 31. The method of claim 29 wherein the phoneme-based unit comprises a phoneme and wherein the context-dependent phoneme-based unit comprises a quinphone.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Huang, Xuedong D., Acero, Alejandro, Hon, Hsiao-Wuen
Primary Examiner(s)
Knepper, David D.

Application Number

US08/949,138
Time in Patent Office

1,174 Days
Field of Search

704/243-245, 704/255-257, 704/258, 704/260, 704/266-269
US Class Current

704/260
CPC Class Codes

G10L 13/07 Concatenation rules

Text-to-speech using clustered context-dependent phoneme-based units

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

31 Claims

Specification

Solutions

Use Cases

Quick Links

Text-to-speech using clustered context-dependent phoneme-based units

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

31 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links