Methods for aligning expressive speech utterances with text and systems therefor
First Claim
1. A method for aligning a sequence of expressive speech utterances with corresponding text, the method comprising:
- processing a speech signal embodying the sequence of expressive speech utterances, the speech utterances being pronounced according to defined prosodic rules;
marking the speech signal with a pitch marker at a predetermined point in a cycle of the speech signal, the pitch marker indicating a pitch change in the speech signal and the speech signal is additionally marked with at least one further pitch marker at the same predetermined point in a further cycle of the speech signal;
refining placement of the pitch marker resulting in a pitch marked speech signal;
providing a sequence of prosodically marked text;
predicting an acoustic target value associated with a first phonetic subunit associated with the sequence of prosodically marked text;
retrieving, a plurality of segmental speech units based on the acoustic target value;
selecting, a segmental speech unit of the plurality of segmental speech units as a first selected segmental speech unit;
employing a first hierarchical mixture of experts (HME) and a second HME, to relate a sequence of selected segmental speech units, including the first selected segmental speech unit, to the sequence of prosodically marked text, wherein the second HME is trained with the first HME, and the sequence of selected segmental speech units is evaluated by the first HME before being evaluated by the second HME;
aligning the sequence of prosodically marked text with the pitch marked speech signal into an aligned speech signal; and
converting the aligned speech signal into an outputted speech signal sequence.
1 Assignment
0 Petitions
Accused Products
Abstract
A system-effected method for synthesizing speech, or recognizing speech including a sequence of expressive speech utterances. The method can be computer-implemented and can include system-generating a speech signal embodying the sequence of expressive speech utterances. Other possible steps include: system-marking the speech signal with a pitch marker indicating a pitch change at or near a first zero amplitude crossing point of the speech signal following a glottal closure point, at a minimum, at a maximum or at another location; system marking the speech signal with at least one further pitch marker; system-aligning a sequence of prosodically marked text with the pitch-marked speech signal according to the pitch markers; and system outputting the aligned text or the aligned speech signal, respectively. Computerized systems, and stored programs for implementing method embodiments of the invention are also disclosed.
33 Citations
12 Claims
-
1. A method for aligning a sequence of expressive speech utterances with corresponding text, the method comprising:
-
processing a speech signal embodying the sequence of expressive speech utterances, the speech utterances being pronounced according to defined prosodic rules; marking the speech signal with a pitch marker at a predetermined point in a cycle of the speech signal, the pitch marker indicating a pitch change in the speech signal and the speech signal is additionally marked with at least one further pitch marker at the same predetermined point in a further cycle of the speech signal; refining placement of the pitch marker resulting in a pitch marked speech signal; providing a sequence of prosodically marked text; predicting an acoustic target value associated with a first phonetic subunit associated with the sequence of prosodically marked text; retrieving, a plurality of segmental speech units based on the acoustic target value; selecting, a segmental speech unit of the plurality of segmental speech units as a first selected segmental speech unit; employing a first hierarchical mixture of experts (HME) and a second HME, to relate a sequence of selected segmental speech units, including the first selected segmental speech unit, to the sequence of prosodically marked text, wherein the second HME is trained with the first HME, and the sequence of selected segmental speech units is evaluated by the first HME before being evaluated by the second HME; aligning the sequence of prosodically marked text with the pitch marked speech signal into an aligned speech signal; and converting the aligned speech signal into an outputted speech signal sequence. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
Specification