Unnatural prosody detection in speech synthesis

US 8,583,438 B2
Filed: 09/20/2007
Issued: 11/12/2013
Est. Priority Date: 09/20/2007
Status: Expired due to Fees

First Claim

Patent Images

1. At least one computer storage medium having computer-executable instructions that, when executed by a computer, cause the computer to perform a method comprising:

building, based on text, a lattice comprising speech units, wherein each speech unit in the lattice is obtained from a database comprising a plurality of candidate speech units;

finding, by the computer in the lattice, a sequence of speech units that conforms to the text;

pruning, by the computer from the sequence of speech units, any of the speech units in the sequence that, based on likelihood ratios and a prosody model that was trained using actual speech, are detected to have unnatural prosody, where the prosody model exhibits a bias toward detecting unnatural prosody;

iterating, by the computer, the finding and the pruning until completion that is based on a condition selected from a group of conditions comprising;

1) every speech unit in the sequence corresponding to natural prosody, and

2) iterating a maximum number of iterations.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Described is a technology by which synthesized speech generated from text is evaluated against a prosody model (trained offline) to determine whether the speech will sound unnatural. If so, the speech is regenerated with modified data. The evaluation and regeneration may be iterative until deemed natural sounding. For example, text is built into a lattice that is then (e.g., Viterbi) searched to find a best path. The sections (e.g., units) of data on the path are evaluated via a prosody model. If the evaluation deems a section to correspond to unnatural prosody, that section is replaced, e.g., by modifying/pruning the lattice and re-performing the search. Replacement may be iterative until all sections pass the evaluation. Unnatural prosody detection may be biased such that during evaluation, unnatural prosody is falsely detected at a higher rate relative to a rate at which unnatural prosody is missed.

Citations

15 Claims

1. At least one computer storage medium having computer-executable instructions that, when executed by a computer, cause the computer to perform a method comprising:
- building, based on text, a lattice comprising speech units, wherein each speech unit in the lattice is obtained from a database comprising a plurality of candidate speech units;
  
  finding, by the computer in the lattice, a sequence of speech units that conforms to the text;
  
  pruning, by the computer from the sequence of speech units, any of the speech units in the sequence that, based on likelihood ratios and a prosody model that was trained using actual speech, are detected to have unnatural prosody, where the prosody model exhibits a bias toward detecting unnatural prosody;
  
  iterating, by the computer, the finding and the pruning until completion that is based on a condition selected from a group of conditions comprising;
  
  1) every speech unit in the sequence corresponding to natural prosody, and
  
  2) iterating a maximum number of iterations.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The at least one computer storage medium of claim 1, the method further comprising concatenating, in response to the completion, the speech units of the sequence resulting in a speech waveform the corresponds to the text.
  - 3. The at least one computer storage medium of claim 1 wherein the pruning further comprises replacing the speech unit in the lattice with one of the candidate speech units.
  - 4. The at least one computer storage medium of claim 1 wherein the pruning further comprises searching the lattice using a Viterbi search algorithm to find the sequence.
  - 5. The at least one computer storage medium of claim 1 wherein the pruning further comprises measuring a phoneme fitness and a syllable fitness and a transition smoothness of the speech units in the sequence.

6. A method comprising:
- building, by a computer and based on text, a lattice comprising speech units, wherein each speech unit in the lattice is obtained from a database comprising a plurality of candidate speech units;
  
  finding, by the computer in the lattice, a sequence of speech units that conforms to the text;
  
  pruning, by the computer from the sequence of speech units, any of the speech units in the sequence that, based on likelihood ratios and a prosody model that was trained using actual speech, are detected to have unnatural prosody, where the prosody model exhibits a bias toward detecting unnatural prosody;
  
  iterating, by the computer, the finding and the pruning until completion that is based on a condition selected from a group of conditions comprising;
  
  1) every speech unit in the sequence corresponding to natural prosody, and
  
  2) iterating a maximum number of iterations.
- View Dependent Claims (7, 8, 9, 10)
- - 7. The method of claim 6 further comprising concatenating, in response to the completion, the speech units of the sequence resulting in a speech waveform the corresponds to the text.
  - 8. The method of claim 6 wherein the pruning further comprises replacing the speech unit in the lattice with one of the candidate speech units.
  - 9. The method of claim 6 wherein the pruning further comprises searching the lattice using a Viterbi search algorithm to find the sequence.
  - 10. The method of claim 6 wherein the pruning further comprises measuring a phoneme fitness and a syllable fitness and a transition smoothness of the speech units in the sequence.

11. A system comprising:
- a computer;
  
  a text analyzer implemented at least in part by the computer and configured for building, based on text, a lattice comprising speech units, wherein each speech unit in the lattice is obtained from a database comprising a plurality of candidate speech units;
  
  a search mechanism implemented at least in part by the computer and configured for finding, in the lattice, a sequence of speech units that conforms to the text;
  
  a pruning mechanism implemented at least in part by the computer and configured for pruning, from the sequence of speech units, any of the speech units in the sequence that, based on likelihood ratios and a prosody model that was trained using actual speech, are detected to have unnatural prosody, where the prosody model exhibits a bias toward detecting unnatural prosody;
  
  a detection mechanism implemented at least in part by the computer and configured for iterating the finding and the pruning until completion that is based on a condition selected from a group of conditions comprising;
  
  1) every speech unit in the sequence corresponding to natural prosody, and
  
  2) iterating a maximum number of iterations.
- View Dependent Claims (12, 13, 14, 15)
- - 12. The system of claim 11 further comprising a concatenation mechanism implemented by the computer and configured for concatenating, in response to the completion, the speech units of the sequence resulting in a speech waveform the corresponds to the text.
  - 13. The system of claim 11 wherein the pruning further comprises replacing the speech unit in the lattice with one of the candidate speech units.
  - 14. The system of claim 11 wherein the pruning further comprises searching the lattice using a Viterbi search algorithm to find the sequence.
  - 15. The system of claim 11 wherein the pruning further comprises measuring a phoneme fitness and a syllable fitness and a transition smoothness of the speech unit.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Zhao, Yong, Soong, Frank Kao-ping, Chu, Min, Wang, Lijuan
Primary Examiner(s)
Godbold, Douglas

Application Number

US11/903,020
Publication Number

US 20090083036A1
Time in Patent Office

2,245 Days
Field of Search

704/258, 704/260
US Class Current

704/260
CPC Class Codes

G10L 13/10 Prosody rules derived from ...

Unnatural prosody detection in speech synthesis

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

15 Claims

Specification

Solutions

Use Cases

Quick Links

Unnatural prosody detection in speech synthesis

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

15 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links