Text-to-speech using clustered context-dependent phoneme-based units
First Claim
1. A method for generating speech from text, comprising the steps of:
- storing a set of decision tree context-dependent phoneme-based units of a target speaker, wherein a central phoneme-based unit is selected from a group consisting of a phoneme and a diphone, wherein each context-dependent phoneme-based unit is arranged based on context of at least one immediately preceding and succeeding phoneme-based unit, and wherein one context-dependent phoneme-based unit is chosen to represent each leaf node in the decision trees;
obtaining a string of phonetic symbols representative of a text to be converted to speech;
selecting stored decision-tree based context-dependent phoneme-based units from the set of decision tree based context-dependent phoneme-based units based on the contexts of the phonetic symbols; and
synthesizing the selected context-based phoneme-based units to generate speech corresponding to the text.
2 Assignments
0 Petitions
Accused Products
Abstract
A text-to-speech system includes a storage device for storing a clustered set of context-dependent phoneme-based units of a target speaker. In one embodiment, decision trees are used wherein each decision tree based context-dependent phoneme-based unit is arranged based on context of at least one immediately preceding and succeeding phoneme. At least one of the context-dependent phoneme-based units represents other non-stored context-dependent phoneme units of similar sound due to similar contexts. A text analyzer obtains a string of phonetic symbols representative of text to be converted to speech. A concatenation module selects stored decision tree based context-dependent phoneme-based units from the set decision tree based context-dependent phoneme-based units based on the context of the phonetic symbols and synthesizes the selected phoneme-based units to generate speech corresponding to the text.
-
Citations
31 Claims
-
1. A method for generating speech from text, comprising the steps of:
-
storing a set of decision tree context-dependent phoneme-based units of a target speaker, wherein a central phoneme-based unit is selected from a group consisting of a phoneme and a diphone, wherein each context-dependent phoneme-based unit is arranged based on context of at least one immediately preceding and succeeding phoneme-based unit, and wherein one context-dependent phoneme-based unit is chosen to represent each leaf node in the decision trees; obtaining a string of phonetic symbols representative of a text to be converted to speech; selecting stored decision-tree based context-dependent phoneme-based units from the set of decision tree based context-dependent phoneme-based units based on the contexts of the phonetic symbols; and synthesizing the selected context-based phoneme-based units to generate speech corresponding to the text. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. An apparatus for generating speech from text, comprising:
-
storage means for storing a set of decision tree based context-dependent phoneme-based units of a target speaker, wherein a central phoneme-based unit is selected from a group consisting of a phoneme and a diphone, wherein each context-dependent phoneme-based unit is arranged based on context of at least one immediately preceding and succeeding phoneme-based unit, and wherein at least one of the context-dependent phoneme-based units represents other non-stored context-dependent phoneme-based units of similar sound due to similar contexts; a text analyzer for obtaining a string of phonetic symbols representative of a text to be converted to speech; and a concatenation module for selecting stored decision tree base context-dependent phoneme-based units from the set of decision tree based context-dependent phoneme-based units based on the context of the phonetic symbols and synthesizing the selected context-dependent phoneme-based units to generate speech corresponding to the text. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. A method for creating context dependent synthesis units of a text-to-speech system, the method comprising the steps of:
-
storing input speech from a target speaker and corresponding phonetic symbols of the input speech; identifying each unique context-dependent phoneme-based unit of the input speech, wherein a central phoneme-based unit is selected from a group consisting of a phoneme and a diphone; training a Hidden Markov Model (HMM) for each unique context-dependent phoneme-based unit based on context of at least one immediately preceding and succeeding phoneme-based units; clustering the HMMs into groups having the same central phoneme-based unit that sound similar but have different preceding or succeeding phoneme-based units; and selecting a context-dependent phoneme-based unit of each group to represent the corresponding group. - View Dependent Claims (16, 17, 18, 19, 20, 21)
-
-
22. An apparatus for creating context dependent synthesis phoneme-based units of a text-to-speech system, the method comprising the steps of:
-
means for storing input speech from a target speaker and corresponding phonetic symbols of the input speech; a training module for identifying each unique context-dependent phoneme-based unit of the input speech and training a Hidden Markov Model (HMM) for each unique context-dependent phoneme-based unit based on context of at least one immediately preceding and succeeding phoneme-based unit, wherein a central phoneme-based unit is selected from a group consisting of a phoneme and a diphone; a clustering module for clustering the HMMs into groups having the same central phoneme-based unit that sound similar but have different preceding or succeeding phoneme-based units and selecting one of context-dependent phoneme-based unit of each group to represent the corresponding group. - View Dependent Claims (23, 24, 25, 26, 27, 28)
-
-
29. A method for generating speech from text, comprising the steps of:
-
storing a set of HMM context-dependent phoneme-based units of a target speaker, wherein a central phoneme-based unit is selected from a group consisting of a phoneme and a diphone, wherein each HMM context-dependent phoneme-based unit is arranged based on context of at least one immediately preceding and succeeding phoneme-based unit, and wherein at least one of the HMM context-dependent phoneme-based units represents other non-stored HMM context-dependent phoneme-based units of similar sound due to context; obtaining a string of phonetic symbols representative of a text to be converted to speech; selecting stored HMM context-dependent phoneme-based units from the set of HMM context-dependent phoneme-based units based on the context of the phonetic symbols; and synthesizing the selected HMM context-dependent phoneme-based units to generate speech corresponding to the text. - View Dependent Claims (30, 31)
-
Specification