Rich context modeling for text-to-speech engines

US 8,340,965 B2
Filed: 12/02/2009
Issued: 12/25/2012
Est. Priority Date: 09/02/2009
Status: Active Grant

First Claim

Patent Images

1. A computer readable medium storing computer-executable instructions that, when executed, cause one or more processors to perform acts comprising:

obtaining trained decision tree-tied hidden Markov Models (HMMs) for a speech corpus;

estimating mean parameters of a plurality of rich context models based on the trained decision tree-tied HMMs by performing a single pass re-estimation;

setting variance parameters of the plurality of rich context models equal to the variance parameters of the trained decision tree-tied HMMs to produce a plurality of refined rich context models; and

generating synthesized speech for an input text based at least on some of the plurality of refined rich context models.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Embodiments of rich context modeling for speech synthesis are disclosed. In operation, a text-to-speech engine refines a plurality of rich context models based on decision tree-tied Hidden Markov Models (HMMs) to produce a plurality of refined rich context models. The text-to-speech engine then generates synthesized speech for an input text based at least on some of the plurality of refined rich context models.

Citations

23 Claims

1. A computer readable medium storing computer-executable instructions that, when executed, cause one or more processors to perform acts comprising:
- obtaining trained decision tree-tied hidden Markov Models (HMMs) for a speech corpus;
  
  estimating mean parameters of a plurality of rich context models based on the trained decision tree-tied HMMs by performing a single pass re-estimation;
  
  setting variance parameters of the plurality of rich context models equal to the variance parameters of the trained decision tree-tied HMMs to produce a plurality of refined rich context models; and
  
  generating synthesized speech for an input text based at least on some of the plurality of refined rich context models.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The computer readable medium of claim 1, wherein the single pass re-estimate further obtains a state-level alignment of the speech corpus based on the trained decision tree-tied HMMs.
  - 3. The computer readable medium of claim 1, further storing an instruction that, when executed, cause the one or more processors to perform an act comprising outputting the synthesized speech to at least one of an acoustic speaker or a data storage.
  - 4. The computer readable medium of claim 1, wherein the generating comprises:
    - performing pre-selection to compose a rich context model candidate sausage for the input text, the candidate sausage including a plurality of refined rich context model sequences, each sequence including at least some refined rich context models from the plurality of refined rich context models;
      
      selecting one of the plurality of refined rich context model sequences that has a least divergence from a guiding sequence that is obtained from the decision tree-tied HMMs; and
      
      generating output speech for the input text based at least on a rich context model sequence that is selected from the plurality of refined rich context model sequences.
  - 5. The computer readable medium of claim 4, wherein the selecting includes searching for one of the plurality of refined rich context model sequences that has a shortest distance to the guiding sequence based on spectrum, pitch, and duration information of each sequence.
  - 6. The computer readable medium of claim 5, wherein the searching includes searching for one of the plurality of refined rich context model sequences that has the shortest distance via a state-aligned Kullback-Leibler divergence (KLD) approximation.
  - 7. The computer readable medium of claim 4, wherein the generating further includes synthesizing speech based further on line spectral pair (LSP) coefficients, a fundamental frequency, and a gain predicted from the input text.
  - 8. The computer readable medium of claim 1, wherein the generating comprises:
    - performing pre-selection to compose a rich context model candidate sausage for the input text, the candidate sausage including a plurality of refined rich context model sequences, each sequence including at least some refined rich context models from the plurality of refined rich context models;
      
      implementing unit pruning along the candidate sausage to select one or more rich context model sequences with less than a predetermined amount of distortion from a guiding sequence, the guiding sequence obtained from the decision tree-tied HMMs;
      
      conducting a normalized cross correlation-based search to derive a minimal concatenation cost rich context model sequence from the one or more rich context model sequences;
      
      concatenating waveform units of an input text along a path of the minimal concatenation cost rich context sequence to generate a waveform sequence; and
      
      generating output speech for the input text based at least on the waveform sequence.
  - 9. The computer readable medium of claim 8, wherein the implementing includes pruning refined rich context model sequences encompassed in the candidate sausage that are farther than a predetermined distance from the guiding sequence based on spectrum, pitch, and duration information.
  - 10. The computer readable medium of claim 8, wherein the implementing includes generating a Kullback-Leibler divergence (KLD) target cost table in advance of speech synthesis that facilitates the pruning along the candidate sausage to select the one or more rich context model sequences with less than the predetermined amount of distortion from the guiding sequence, and wherein the conducting includes generating a concatenation cost table in advance of speech synthesis to facilitate derivation of the minimal concatenation cost rich context model sequence.
  - 11. The computer readable medium of claim 8, wherein the generating further includes synthesizing speech based further on line spectral pair (LSP) coefficients, a fundamental frequency, and a gain predicted from the input text.

12. A computer implemented method, comprising:
- under control of one or more computing systems configured with executable instructions,refining a plurality of rich context models based on decision tree-tied Hidden Markov Models (HMMs) to produce a plurality of refined rich context models;
  
  performing pre-selection to compose a rich context model candidate sausage for an input text, the candidate sausage including a plurality of refined rich context model sequences, each sequence including at least some refined rich context models from the plurality of refined rich context models;
  
  selecting one of the plurality of refined rich context model sequences that has a least divergence from a guiding sequence that is obtained from the decision tree-tied HMMs; and
  
  generating output speech for the input text based at least on a rich context model sequence that is selected from the plurality of refined rich context model sequences.
- View Dependent Claims (13, 14, 15, 16)
- - 13. The computer implemented method of claim 12, further comprising outputting the output speech to at least one of an acoustic speaker or a data storage.
  - 14. The computer implemented method of claim 12, wherein the refining further comprises:
    - obtaining trained decision tree-tied hidden Markov Models (HMMs) for a speech corpus;
      
      estimating mean parameters of the rich context models based on the trained decision tree-tied HMMs by performing a single pass re-estimation; and
      
      setting variance parameters of the rich context models equal to variance parameters of the trained decision tree-tied HMMs to produce the plurality of refined rich context models.
  - 15. The computer implemented method of claim 12, wherein the selecting includes searching for one of the plurality of refined rich context model sequences that has a shortest distance to the guiding sequence based on spectrum, pitch, and duration information of each sequence.
  - 16. The computer implemented method of claim 12, wherein the generating further includes synthesizing speech based further on line spectral pair (LSP) coefficients, a fundamental frequency, and a gain predicted from the input text.

17. A system, comprising:
- one or more processors;
  
  a memory that includes a plurality of computer-executable components, the plurality of computer-executable components comprising;
  
  a training module to refine a plurality of rich context models based on decision tree-tied Hidden Markov Models (HMMs) to produce a plurality of refined rich context models;
  
  a pre-selection module to perform pre-selection to compose a rich context model candidate sausage for an input text, the candidate sausage including a plurality of refined rich context model sequences, each sequence including at least some refined rich context models from the plurality of refined rich context models;
  
  a unit pruning module to implement unit pruning along the candidate sausage to select one or more rich context model sequences with less than a predetermined amount of distortion from a guiding sequence, the guiding sequence obtained from the decision tree-tied HMMs;
  
  a cross correlation search module to conduct a normalized cross correlation-based search to derive a minimal concatenation cost rich context model sequence from the one or more rich context model sequences;
  
  a waveform concatenation module to concatenate waveform units of an input text along a path of the minimal concatenation cost rich context model sequence to generate a waveform sequence; and
  
  a synthesis module to generate synthesized speech for the input text based at least on the waveform sequence.
- View Dependent Claims (18, 19, 20, 21, 22, 23)
- - 18. The system of claim 17, further comprising a data storage module to store the synthesized speech.
  - 19. The system of claim 17, wherein the training module is to further:
    - obtain trained decision tree-tied hidden Markov Models (HMMs) for a speech corpus;
      
      estimate mean parameters of the rich context models based on the trained decision tree-tied HMMs by performing a single pass re-estimation; and
      
      set variance parameters of the rich context models equal to variance parameters of the trained decision tree-tied HMMs to produce the plurality of refined rich context models.
  - 20. The system of claim 17, wherein the unit pruning module is to prune the refined rich context model sequences encompassed in the candidate sausage that are farther than a predetermined distance from the guiding sequence based on spectrum, pitch, and duration information.
  - 21. The system of claim 17, wherein the unit pruning module is to generate a Kullback-Leibler divergence (KLD) target cost table in advance of speech synthesis that facilitates pruning along the candidate sausage to select the one or more rich context model sequences with less than the predetermined amount of distortion from the guiding sequence.
  - 22. The system of claim 17, wherein the cross correlation search module is to generate a concatenation cost table in advance of speech synthesis to facilitate derivation of the minimal concatenation cost rich context model sequence.
  - 23. The system of claim 17, wherein the synthesis module is to synthesize speech based further on line spectral pair (LSP) coefficients, a fundamental frequency, and a gain predicted from the input text.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Yan, Zhi-Jie, Qian, Yao, Soong, Frank Kao-Ping
Primary Examiner(s)
Lerner, Martin

Application Number

US12/629,457
Publication Number

US 20110054903A1
Time in Patent Office

1,119 Days
Field of Search

704/256.2, 704/256.3, 704/258, 704/260, 704/266, 704/269
US Class Current

704/258
CPC Class Codes

G10L 13/08 Text analysis or generation...

Rich context modeling for text-to-speech engines

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

23 Claims

Specification

Solutions

Use Cases

Quick Links

Rich context modeling for text-to-speech engines

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

23 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links