Methods for aligning expressive speech utterances with text and systems therefor

US 10,453,479 B2
Filed: 09/21/2012
Issued: 10/22/2019
Est. Priority Date: 09/23/2011
Status: Active Grant

First Claim

Patent Images

1. A method for aligning a sequence of expressive speech utterances with corresponding text, the method comprising:

processing a speech signal embodying the sequence of expressive speech utterances, the speech utterances being pronounced according to defined prosodic rules;

marking the speech signal with a pitch marker at a predetermined point in a cycle of the speech signal, the pitch marker indicating a pitch change in the speech signal and the speech signal is additionally marked with at least one further pitch marker at the same predetermined point in a further cycle of the speech signal;

refining placement of the pitch marker resulting in a pitch marked speech signal;

providing a sequence of prosodically marked text;

predicting an acoustic target value associated with a first phonetic subunit associated with the sequence of prosodically marked text;

retrieving, a plurality of segmental speech units based on the acoustic target value;

selecting, a segmental speech unit of the plurality of segmental speech units as a first selected segmental speech unit;

employing a first hierarchical mixture of experts (HME) and a second HME, to relate a sequence of selected segmental speech units, including the first selected segmental speech unit, to the sequence of prosodically marked text, wherein the second HME is trained with the first HME, and the sequence of selected segmental speech units is evaluated by the first HME before being evaluated by the second HME;

aligning the sequence of prosodically marked text with the pitch marked speech signal into an aligned speech signal; and

converting the aligned speech signal into an outputted speech signal sequence.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system-effected method for synthesizing speech, or recognizing speech including a sequence of expressive speech utterances. The method can be computer-implemented and can include system-generating a speech signal embodying the sequence of expressive speech utterances. Other possible steps include: system-marking the speech signal with a pitch marker indicating a pitch change at or near a first zero amplitude crossing point of the speech signal following a glottal closure point, at a minimum, at a maximum or at another location; system marking the speech signal with at least one further pitch marker; system-aligning a sequence of prosodically marked text with the pitch-marked speech signal according to the pitch markers; and system outputting the aligned text or the aligned speech signal, respectively. Computerized systems, and stored programs for implementing method embodiments of the invention are also disclosed.

33 Citations

View as Search Results

12 Claims

1. A method for aligning a sequence of expressive speech utterances with corresponding text, the method comprising:
- processing a speech signal embodying the sequence of expressive speech utterances, the speech utterances being pronounced according to defined prosodic rules;
  
  marking the speech signal with a pitch marker at a predetermined point in a cycle of the speech signal, the pitch marker indicating a pitch change in the speech signal and the speech signal is additionally marked with at least one further pitch marker at the same predetermined point in a further cycle of the speech signal;
  
  refining placement of the pitch marker resulting in a pitch marked speech signal;
  
  providing a sequence of prosodically marked text;
  
  predicting an acoustic target value associated with a first phonetic subunit associated with the sequence of prosodically marked text;
  
  retrieving, a plurality of segmental speech units based on the acoustic target value;
  
  selecting, a segmental speech unit of the plurality of segmental speech units as a first selected segmental speech unit;
  
  employing a first hierarchical mixture of experts (HME) and a second HME, to relate a sequence of selected segmental speech units, including the first selected segmental speech unit, to the sequence of prosodically marked text, wherein the second HME is trained with the first HME, and the sequence of selected segmental speech units is evaluated by the first HME before being evaluated by the second HME;
  
  aligning the sequence of prosodically marked text with the pitch marked speech signal into an aligned speech signal; and
  
  converting the aligned speech signal into an outputted speech signal sequence.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. A method according to claim 1, further comprising operating the first hierarchical mixture of experts recursively to reduce an error signal and enable the first hierarchical mixture of experts to learn.
  - 3. A method according to claim 1, further comprising training the second HME after training of the first HME.
  - 4. A method according to claim 1, further comprising refining an output of the first HME with the second HME.
  - 5. A computerized speech synthesizer, comprising non-transitory computer-readable media and a program stored in the non-transitory computer-readable media, the program including instructions for performing the method according to claim 1.
  - 6. The computerized speech synthesizer of claim 5, wherein the program further comprises instructions for operating the first hierarchical mixture of experts recursively to reduce an error signal and enable the first hierarchical mixture of experts to learn.
  - 7. The computerized speech synthesizer of claim 5, wherein the program further comprises instructions for training the second HME after training of the first HME.
  - 8. The computerized speech synthesizer of claim 5, wherein the program further comprises instructions for refining an output of the first HME with the second HME.
  - 9. Non-transitory computer media storing at least one program including instructions for performing a method according to claim 1.
  - 10. The non-transitory computer media of claim 9, wherein the at least one program further comprises instructions for operating the first hierarchical mixture of experts recursively to reduce an error signal and enable the first hierarchical mixture of experts to learn.
  - 11. The non-transitory computer media of claim 9, wherein the at least one program further comprises instructions for training the second HME after training of the first HME.
  - 12. The non-transitory computer media of claim 9, wherein the at least one program further comprises instructions for refining an output of the first HME with the second HME.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Lessac Technologies, Inc.
Original Assignee
Lessac Technologies, Inc.
Inventors
Wilhelms-Tricarico, Reiner, Mottershead, Brian, Nitisaroj, Rattima, Baumgartner, Michael, Reichenbach, John B., Marple, Gary A.
Primary Examiner(s)
Wozniak, James S

Application Number

US13/624,175
Publication Number

US 20130262096A1
Time in Patent Office

2,587 Days
Field of Search

704232, 704236, 704243, 704258, 704259, 704261
US Class Current
CPC Class Codes

G10L 13/04   Details of speech synthesis...

G10L 13/08   Text analysis or generation...

G10L 15/1807   using prosody or stress

G10L 25/30   using neural networks

G10L 25/75   for modelling vocal tract p...

G10L 25/90   Pitch determination of spee...

Methods for aligning expressive speech utterances with text and systems therefor

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

33 Citations

12 Claims

Specification

Solutions

Use Cases

Quick Links

Methods for aligning expressive speech utterances with text and systems therefor

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

33 Citations

12 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links