Unnatural prosody detection in speech synthesis
First Claim
1. At least one computer storage medium having computer-executable instructions that, when executed by a computer, cause the computer to perform a method comprising:
- building, based on text, a lattice comprising speech units, wherein each speech unit in the lattice is obtained from a database comprising a plurality of candidate speech units;
finding, by the computer in the lattice, a sequence of speech units that conforms to the text;
pruning, by the computer from the sequence of speech units, any of the speech units in the sequence that, based on likelihood ratios and a prosody model that was trained using actual speech, are detected to have unnatural prosody, where the prosody model exhibits a bias toward detecting unnatural prosody;
iterating, by the computer, the finding and the pruning until completion that is based on a condition selected from a group of conditions comprising;
1) every speech unit in the sequence corresponding to natural prosody, and
2) iterating a maximum number of iterations.
2 Assignments
0 Petitions
Accused Products
Abstract
Described is a technology by which synthesized speech generated from text is evaluated against a prosody model (trained offline) to determine whether the speech will sound unnatural. If so, the speech is regenerated with modified data. The evaluation and regeneration may be iterative until deemed natural sounding. For example, text is built into a lattice that is then (e.g., Viterbi) searched to find a best path. The sections (e.g., units) of data on the path are evaluated via a prosody model. If the evaluation deems a section to correspond to unnatural prosody, that section is replaced, e.g., by modifying/pruning the lattice and re-performing the search. Replacement may be iterative until all sections pass the evaluation. Unnatural prosody detection may be biased such that during evaluation, unnatural prosody is falsely detected at a higher rate relative to a rate at which unnatural prosody is missed.
-
Citations
15 Claims
-
1. At least one computer storage medium having computer-executable instructions that, when executed by a computer, cause the computer to perform a method comprising:
-
building, based on text, a lattice comprising speech units, wherein each speech unit in the lattice is obtained from a database comprising a plurality of candidate speech units; finding, by the computer in the lattice, a sequence of speech units that conforms to the text; pruning, by the computer from the sequence of speech units, any of the speech units in the sequence that, based on likelihood ratios and a prosody model that was trained using actual speech, are detected to have unnatural prosody, where the prosody model exhibits a bias toward detecting unnatural prosody; iterating, by the computer, the finding and the pruning until completion that is based on a condition selected from a group of conditions comprising;
1) every speech unit in the sequence corresponding to natural prosody, and
2) iterating a maximum number of iterations. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A method comprising:
-
building, by a computer and based on text, a lattice comprising speech units, wherein each speech unit in the lattice is obtained from a database comprising a plurality of candidate speech units; finding, by the computer in the lattice, a sequence of speech units that conforms to the text; pruning, by the computer from the sequence of speech units, any of the speech units in the sequence that, based on likelihood ratios and a prosody model that was trained using actual speech, are detected to have unnatural prosody, where the prosody model exhibits a bias toward detecting unnatural prosody; iterating, by the computer, the finding and the pruning until completion that is based on a condition selected from a group of conditions comprising;
1) every speech unit in the sequence corresponding to natural prosody, and
2) iterating a maximum number of iterations. - View Dependent Claims (7, 8, 9, 10)
-
-
11. A system comprising:
-
a computer; a text analyzer implemented at least in part by the computer and configured for building, based on text, a lattice comprising speech units, wherein each speech unit in the lattice is obtained from a database comprising a plurality of candidate speech units; a search mechanism implemented at least in part by the computer and configured for finding, in the lattice, a sequence of speech units that conforms to the text; a pruning mechanism implemented at least in part by the computer and configured for pruning, from the sequence of speech units, any of the speech units in the sequence that, based on likelihood ratios and a prosody model that was trained using actual speech, are detected to have unnatural prosody, where the prosody model exhibits a bias toward detecting unnatural prosody; a detection mechanism implemented at least in part by the computer and configured for iterating the finding and the pruning until completion that is based on a condition selected from a group of conditions comprising;
1) every speech unit in the sequence corresponding to natural prosody, and
2) iterating a maximum number of iterations. - View Dependent Claims (12, 13, 14, 15)
-
Specification