Trajectory Tiling Approach for Text-to-Speech

US 20120143611A1
Filed: 12/07/2010
Published: 06/07/2012
Est. Priority Date: 12/07/2010
Status: Abandoned Application

First Claim

Patent Images

1. A computer-readable medium storing computer-executable instructions that, when executed, cause one or more processors to perform acts comprising:

obtaining a set of Hidden Markov Models (HMMs) and a set of waveform units from a speech corpus;

refining the set of HMMs via minimum generation error (MGE) training to generate a refined set of HMMs;

generating a speech parameter trajectory by applying the refined set of HMMs to an input text;

constructing a unit lattice of candidate waveform units selected from the set of waveform units based at least on the speech parameter trajectory;

performing a normalized cross-correlation (NCC)-based search on the unit lattice to obtain a minimal concatenation cost sequence of candidate waveform units; and

concatenating the minimum concatenation cost sequence of candidate waveform units into a concatenated waveform sequence.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Hidden Markov Models HMM trajectory tiling (HTT)-based approaches may be used to synthesize speech from text. In operation, a set of Hidden Markov Models (HMMs) and a set of waveform units may be obtained from a speech corpus. The set of HMMs are further refined via minimum generation error (MGE) training to generate a refined set of HMMs. Subsequently, a speech parameter trajectory may be generated by applying the refined set of HMMs to an input text. A unit lattice of candidate waveform units may be selected from the set of waveform units based at least on the speech parameter trajectory. A normalized cross-correlation (NCC)-based search on the unit lattice may be performed to obtain a minimal concatenation cost sequence of candidate waveform units, which are concatenated into a concatenated waveform sequence that is synthesized into speech.

57 Citations

View as Search Results

20 Claims

1. A computer-readable medium storing computer-executable instructions that, when executed, cause one or more processors to perform acts comprising:
- obtaining a set of Hidden Markov Models (HMMs) and a set of waveform units from a speech corpus;
  
  refining the set of HMMs via minimum generation error (MGE) training to generate a refined set of HMMs;
  
  generating a speech parameter trajectory by applying the refined set of HMMs to an input text;
  
  constructing a unit lattice of candidate waveform units selected from the set of waveform units based at least on the speech parameter trajectory;
  
  performing a normalized cross-correlation (NCC)-based search on the unit lattice to obtain a minimal concatenation cost sequence of candidate waveform units; and
  
  concatenating the minimum concatenation cost sequence of candidate waveform units into a concatenated waveform sequence.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The computer-readable medium of claim 1, further comprising storing an instruction that, when executed, cause the one or more processors to perform an act of outputting the concatenated waveform sequence as synthesized speech.
  - 3. The computer-readable medium of claim 2, wherein the outputting includes outputting the synthesized speech to at least one of an acoustic speaker or a data storage.
  - 4. The computer-readable medium of claim 1, further comprising instructions that, when executed, cause the one or more processors to perform an act of converting the input text into an phoneme sequence based at least in part on context or usage information of the input text.
  - 5. The computer-readable medium of claim 1, further comprising instructions that, when executed, cause the one or more processors to perform an act of formant sharpening on the speech parameter trajectory to reduce over-smoothing of the speech parameter trajectory.
  - 6. The computer-readable medium of claim 1, further comprising instructions that, when executed, cause the one or more processors to perform an act of applying a minimum voiced/unvoiced error algorithm to the speech parameter trajectory to compensate for voice quality degrades caused by noisy or flawed acoustic features in the speech corpus.
  - 7. The computer-readable medium of claim 1, further comprising instructions that, when executed, cause the one or more processors to perform an act of pruning the unit lattice using at least one of context pruning, beam pruning, or histogram pruning.
  - 8. The computer-readable medium of claim 1, wherein the speech parameter trajectory includes target units, and wherein the constructing the unit lattice includes using normalized distances between the target units and the set of waveform units to select the candidate waveform units, each of the distances measuring differences between line spectral pair (LSP) coefficients, gains, and fundamental frequencies of a target unit and a waveform unit.
  - 9. The computer-readable medium of claim 8, further comprising instructions that, when executed, cause the one or more processors to perform an act of smoothing spectral peaks of the speech parameter trajectory prior to the constructing of the unit lattice.

10. A computer implemented method, comprising:
- under control of one or more computing systems configured with executable instructions,obtaining a set of Hidden Markov Models (HMMs) and an initial set of waveform units from a speech corpus, each waveform unit in the initial set having a first time length;
  
  generating a speech parameter trajectory by applying the set of HMMs to an input text;
  
  constructing a unit lattice of candidate waveform units selected from the initial set of waveform units based at least on the speech parameter trajectory;
  
  performing a normalized cross-correlation (NCC)-based search on the unit lattice to search for a sequence of candidate waveform units along a minimum concatenation cost path;
  
  concatenating the sequence of candidate waveform units into a concatenated waveform sequence when the sequence of waveform units is found along the minimum concatenation cost path; and
  
  generating a modified set of waveform units from the speech corpus when no sequence of candidate waveform units is found along the minimum concatenation cost path, each waveform unit in the modified set having a second time length that is less than the first time length.
- View Dependent Claims (11, 12, 13, 14, 15)
- - 11. The computer implemented method of claim 10, further comprising outputting the concatenated waveform sequence as synthesized speech.
  - 12. The computer implemented method of claim 10, wherein the constructing includes using normalized distances between target units of an initial time length in the speech parameter trajectory and the set of waveform units to select the candidate waveform units.
  - 13. The computer implemented method of claim 10, further comprising refining the set of HMMs via minimum generation error (MGE) training.
  - 14. The computer implemented method of claim 10, further comprising applying a minimum voiced/unvoiced error algorithm to the speech parameter trajectory to compensate for voice quality degrades caused by noisy or flawed acoustic features in the speech corpus.
  - 15. The computer implemented method of claim 10, further comprising pruning the unit lattice using at least one of context pruning, beam pruning, or histogram pruning.

16. A system, comprising:
- one or more processors; and
  
  a memory that includes a plurality of computer-executable components, the plurality of computer-executable components comprising;
  
  a Hidden Markov Model (HMM) component to obtain a set of HMMs from a speech corpus;
  
  a refinement component to refine the set of HMMs via minimum generation error (MGE) training to generate a refined set of HMMs; and
  
  a trajectory generation component to generate a speech parameter trajectory by applying the refined set of HMMs to an input text.
- View Dependent Claims (17, 18, 19, 20)
- - 17. The system of claim 16, further comprising a waveform segmentation component to segment one or more speech waveforms of the speech corpus into a set of waveform units.
  - 18. The system of claim 17, further comprising a lattice construction component to construct a unit lattice of candidate waveform units selected from the set of waveform units based at least on the speech parameter trajectory.
  - 19. The system of claim 18, further comprising a concatenation component to perform a normalized cross-correlation (NCC)-based search on the unit lattice to obtain a minimal concatenation cost sequence of candidate waveform units, and concatenate the minimum concatenation cost sequence of candidate waveform units into a concatenated waveform sequence.
  - 20. The system of claim 18, wherein the speech parameter trajectory includes target units, and wherein the lattice construction component constructs the unit lattice by using normalized distances between the target units and the set of waveform units to select the candidate waveform units, each of the normalized distances measuring differences between line spectral pair (LSP) coefficients, gains, and fundamental frequencies of a target unit and a waveform unit.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Qian, Yao, Yan, Zhi-Jie, Wu, Yi-Jian, Soong, Frank Kao-Ping

Application Number

US12/962,543
Publication Number

US 20120143611A1
Time in Patent Office

Days
Field of Search
US Class Current

704/260
CPC Class Codes

G10L 13/00   Speech synthesis; Text to s...

G10L 13/06   Elementary speech units use...

G10L 13/07   Concatenation rules

G10L 25/06   the extracted parameters be...

G10L 25/24   the extracted parameters be...

Trajectory Tiling Approach for Text-to-Speech

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

57 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Trajectory Tiling Approach for Text-to-Speech

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

57 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links