Unnatural prosody detection in speech synthesis
First Claim
1. A computer-readable medium having computer-executable instructions, which when executed perform steps, comprising:
- evaluating at least one section of data corresponding to speech synthesized from text via a prosody model that detects unnatural prosody; and
for each section, replacing that section with another section if the evaluation deems that section to correspond to unnatural prosody.
2 Assignments
0 Petitions
Accused Products
Abstract
Described is a technology by which synthesized speech generated from text is evaluated against a prosody model (trained offline) to determine whether the speech will sound unnatural. If so, the speech is regenerated with modified data. The evaluation and regeneration may be iterative until deemed natural sounding. For example, text is built into a lattice that is then (e.g., Viterbi) searched to find a best path. The sections (e.g., units) of data on the path are evaluated via a prosody model. If the evaluation deems a section to correspond to unnatural prosody, that section is replaced, e.g., by modifying/pruning the lattice and re-performing the search. Replacement may be iterative until all sections pass the evaluation. Unnatural prosody detection may be biased such that during evaluation, unnatural prosody is falsely detected at a higher rate relative to a rate at which unnatural prosody is missed.
305 Citations
20 Claims
-
1. A computer-readable medium having computer-executable instructions, which when executed perform steps, comprising:
-
evaluating at least one section of data corresponding to speech synthesized from text via a prosody model that detects unnatural prosody; and for each section, replacing that section with another section if the evaluation deems that section to correspond to unnatural prosody. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. In a computing environment, a system comprising:
-
a database containing data corresponding to speech; a search mechanism coupled to the database that searches for a best path through a lattice built from input data, the best path corresponding to speech data; and a model coupled to the search mechanism that detects any unnatural speech provided from the search mechanism, and when detected modifies the lattice to run at least one additional search via the search mechanism without having the unnatural speech again provided by the search mechanism. - View Dependent Claims (8, 9, 10, 11, 12, 13, 14)
-
-
15. In a computing environment, a system comprising:
-
(a) accessing a data store to find speech units corresponding to text and building a current lattice representing the speech units and transitions between the speech units; (b) searching the current lattice to determine a best path through the current lattice; (c) evaluating data corresponding to the best path speech units against a prosody model to detect unnatural prosody, and if no unnatural prosody is detected or an iteration limit is reached, continuing to step (d), or if unnatural prosody is detected and the iteration limit is not reached, modifying the lattice at each section corresponding to the unnatural prosody into a modified current lattice so that a different best path will be determined upon a subsequent search, and returning to step (b); and (d) processing the speech units to generate a speech waveform. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification