Method and apparatus for speech synthesis without prosody modification

US 20020099547A1
Filed: 05/07/2001
Published: 07/25/2002
Est. Priority Date: 12/04/2000
Status: Active Grant

First Claim

Patent Images

1. A method for synthesizing speech, the method comprising:

generating a training context vector for each of a set of training speech units in a training speech corpus, each training context vector indicating the prosodic context of a training speech unit in the training speech corpus;

indexing a set of speech segments associated with a set of training speech units based on the context vectors for the training speech units;

generating an input context vector for each of a set of input speech units in an input text, each input context vector indicating the prosodic context of an input speech unit in the input text;

using the input context vectors to find a speech segment for each input speech unit; and

concatenating the found speech segments to form a synthesized speech signal.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A speech synthesizer is provided that concatenates stored samples of speech units without modifying the prosody of the samples. The present invention is able to achieve a high level of naturalness in synthesized speech with a carefully designed training speech corpus by storing samples based on the prosodic and phonetic context in which they occur. In particular, some embodiments of the present invention limit the training text to those sentences that will produce the most frequent sets of prosodic contexts for each speech unit. Further embodiments of the present invention also provide a multi-tier selection mechanism for selecting a set of samples that will produce the most natural sounding speech.

Citations

25 Claims

1. A method for synthesizing speech, the method comprising:
- generating a training context vector for each of a set of training speech units in a training speech corpus, each training context vector indicating the prosodic context of a training speech unit in the training speech corpus;
  
  indexing a set of speech segments associated with a set of training speech units based on the context vectors for the training speech units;
  
  generating an input context vector for each of a set of input speech units in an input text, each input context vector indicating the prosodic context of an input speech unit in the input text;
  
  using the input context vectors to find a speech segment for each input speech unit; and
  
  concatenating the found speech segments to form a synthesized speech signal.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 15, 16, 17, 18, 19)
- - 2. The method of claim 1 wherein the each context vector comprises a position-in-phrase coordinate indicating the position of the speech unit in a phrase.
  - 3. The method of claim 1 wherein the each context vector comprises a position-in-word coordinate indicating the position of the speech unit in a word.
  - 4. The method of claim 1 wherein the each context vector comprises a left phonetic coordinate indicating a category for the phoneme to the left of the speech unit.
  - 5. The method of claim 1 wherein the each context vector comprises a right phonetic coordinate indicating a category for the phoneme to the right of the speech unit.
  - 6. The method of claim 1 wherein the each context vector comprises a left tonal coordinate indicating a category for the tone of the speech unit to the left of the speech unit.
  - 7. The method of claim 1 wherein the each context vector comprises a right tonal coordinate indicating a category for the tone of the speech unit to the right of the speech unit.
  - 8. The method of claim 1 wherein indexing a set of speech segments comprises generating a decision tree based on the training context vectors.
  - 9. The method of claim 8 wherein using the input context vectors to find a speech segment comprises searching the decision tree using the input context vector.
  - 10. The method of claim 9 wherein searching the decision tree comprises:
    - identifying a leaf in the tree for each input context vector, each leaf comprising at least one candidate speech segments; and
      
      selecting one candidate speech segment in each leaf node, wherein if there is more than one candidate speech segment on the node The selection is based on a cost function.
  - 11. The method of claim 10 wherein the cost function comprises a distance between the input context vector and a training context vector associated with a speech segment.
  - 12. The method of claim 11 wherein the cost function further comprises a smoothness cost that is based on a candidate speech segment of at least one neighboring speech unit.
  - 13. The method of claim 12 wherein the smoothness cost gives preference to selecting a series of speech segments for a series of input context vectors if the series of speech segments occurred in series in the training speech corpus.
  - 15. The method of claim 14 wherein identifying a collection of prosodic context information sets as necessary context information sets comprises:
    - determining the frequency of occurrence of each prosodic context information set across a very large text corpus; and
      
      identifying a collection of prosodic context information sets as necessary context information sets based on their frequency of occurrence.
  - 16. The method of claim 15 wherein identifying a collection of prosodic context information sets as necessary context information sets further comprises:
    - sorting the context information sets by their frequency of occurrence in decreasing order;
      
      determining a threshold, F, for accumulative frequency of top context vectors; and
      
      selecting the top context vectors whose accumulative frequency is not smaller than F for each speech unit as necessary prosodic context information sets.
  - 17. The method of claim 14 further comprising indexing only those speech segments that are associated with sentences in the smaller training text and wherein indexing comprises indexing using a decision tree.
  - 18. The method of claim 17 wherein indexing further comprises indexing the speech segments in the decision tree based on information in the context information sets.
  - 19. The method of claim 18 wherein the decision tree comprises leaf nodes and at least one leaf node comprises at least two speech segments for the same speech unit.

14. A method of selecting sentences for reading into a training speech corpus used in speech synthesis, the method comprising:
- identifying a set of prosodic context information for each of a set of speech units;
  
  determining a frequency of occurrence for each distinct context vector that appears in a very large text corpus;
  
  using the frequency of occurrence of the context vectors to identify a list of necessary context vectors; and
  
  selecting sentences in the large text corpus for reading into the training speech corpus, each selected sentence containing at least one necessary context vector.

20. A method of selecting speech segments for concatenative speech synthesis, the method comprising:
- parsing an input text into speech units;
  
  identifying context information for each speech unit based on its location in the input text and at least one neighboring speech unit;
  
  identifying a set of candidate speech segments for each speech unit based on the context information; and
  
  identifying a sequence of speech segments from the candidate speech segments based in part on a smoothness cost between the speech segments.
- View Dependent Claims (21, 22, 23, 24)
- - 21. The method of claim 20 wherein identifying a set of candidate speech segments for a speech unit comprises applying the context information for a speech unit to a decision tree to identify a leaf node containing candidate speech segments for the speech unit.
  - 22. The method of claim 21 wherein identifying a set of candidate speech segments further comprises pruning some speech segments from a leaf node based on differences between the context information of the speech unit from the input text and context information associated with the speech segments.
  - 23. The method of claim 20 wherein identifying a sequence of speech segments comprises using a smoothness cost that is based on whether two neighboring candidate speech segments appeared next to each other in a training corpus.
  - 24. The method of claim 21 wherein identifying a sequence of speech segments further comprises identifying the sequence based in part on differences between context information for the speech unit of the input text and context information associated with a candidate speech segment.

25. A computer-readable medium having computer executable instructions for synthesizing speech from speech segments based on speech units found in an input text, the speech being synthesized through a method comprising steps of:
- identifying context information for each speech unit based on the prosodic structure of the input text;
  
  identifying a set of candidate speech segments for each speech unit based on the context information;
  
  identifying a sequence of speech segments from the candidate speech segments;
  
  concatenating the sequence of speech segments without modifying the prosody of the speech segments to form the synthesized speech.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Peng, Hu, Chu, Min

Granted Patent

US 6,978,239 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/260
CPC Class Codes

G10L 13/07 Concatenation rules

Method and apparatus for speech synthesis without prosody modification

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

25 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for speech synthesis without prosody modification

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

25 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links