Concatenative speech synthesis using a finite-state transducer

US 20030055641A1
Filed: 09/17/2001
Published: 03/20/2003
Est. Priority Date: 09/17/2001
Status: Active Grant

First Claim

Patent Images

1. A method for selecting segments from a corpus of source utterances for synthesizing a target utterance, comprising:

searching a graph in which each path through the graph identifies a sequence of segments of the source utterances and a corresponding sequence of unit labels that characterizes a pronunciation of a concatenation of that sequence of segments, each path being associated with a numerical score that characterizes a quality of the sequence of segment;

wherein searching the graph includes matching a pronunciation of the target utterance to paths through the graph, and selecting segments for synthesizing the target utterance based on numerical scores of matching paths through the graph.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method for concatenative speech synthesis includes a processing stage that selects segments based on their symbolic labeling in an efficient graph-based search, which uses a finite-state transducer formalism. This graph-based search uses a representation of concatenation constraints and costs that does not necessarily grow with the size of the source corpus thereby limiting the increase in computation required for the search as the size of the source corpus increases. In one application of this method, multiple alternative segment sequences are generated and a best segment sequence is then be selected using characteristics that depend on specific signal characteristics of the segments.

Citations

18 Claims

1. A method for selecting segments from a corpus of source utterances for synthesizing a target utterance, comprising:
- searching a graph in which each path through the graph identifies a sequence of segments of the source utterances and a corresponding sequence of unit labels that characterizes a pronunciation of a concatenation of that sequence of segments, each path being associated with a numerical score that characterizes a quality of the sequence of segment;
  
  wherein searching the graph includes matching a pronunciation of the target utterance to paths through the graph, and selecting segments for synthesizing the target utterance based on numerical scores of matching paths through the graph.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
- - 2. The method of claim 1 wherein selecting segments for synthesizing the target utterance includes identifying a path through the graph that matches the pronunciation of the target utterance and selecting the sequence of segments that is identified by the determined path.
  - 3. The method of claim 2 wherein determining the path includes determining a best scoring path through the graph.
  - 4. The method of claim 3 wherein determining the best scoring path involves using a dynamic programming algorithm.
  - 5. The method of claim 2 further comprising concatenating the selected sequence of segments to form a waveform representation of the target utterance.
  - 6. The method of claim 1 wherein selecting the segments for synthesizing the target utterance includes determining a plurality of paths through the graph that each matches the representation of the pronunciation of the target utterance.
  - 7. The method of claim 6 wherein selecting the segments farther includes forming a plurality of sequences of segments, each associated with a different one of the plurality of paths.
  - 8. The method of claim 7 wherein selecting the segments further includes selecting one of the sequences of segments based on characteristics of those sequences of segments not determined by the corresponding sequences of unit labels associated with those sequences.
  - 9. The method of claim 1 further comprising forming a representation of a plurality of pronunciations of the target utterance, and wherein searching the graph includes matching any of the pronunciations of the target utterance to paths through the graph.
  - 10. The method of claim 1 further comprising forming a representation of the pronunciation of the target utterance in terms of alternating unit labels and transitions labels.
  - 11. The method of claim 1 wherein the graph includes a first part that encodes a sequence of segments and a corresponding sequence of unit labels for each of the source utterances, and a second part that encodes allowable transitions between segments of different source utterances and encodes a transition score for each of those transitions;
    - and matching the pronunciation of the target utterance to paths through the graph includes considering paths in which each transition between segments of different source utterances identified by that path corresponds to a different subpath of that path that passes through the second part of the graph.
  - 12. The method of claim 10, wherein selecting the segments for synthesis includes evaluating a score for each of the considered paths that is based on the transition scores associated with the subpaths through the second part of the graph.
  - 13. The method of claim 10 wherein a size of the second part of the graph is substantially independent of a size of the source corpus, and a complexity of matching the pronunciation through the graph grows less than linearly with the size of the corpus.
  - 14. The method of claim 1 further comprising:
    - providing the corpus of source utterances, each source utterance being segmented into a sequence of segments, each consecutive pair of segments in a source utterance forming a segment boundary, and each speech segment being associated with a unit label and each segment boundary being associated with a transition label; and
      
      forming the graph, including forming a first part of the graph that encodes a sequence of segments and a corresponding sequence of unit labels for each of the source utterances, and forming a second part that encodes allowable transitions between segments of different source utterances and encodes a transition score for each of those transitions.
  - 15. The method of claim 14 wherein forming the second part of the graph is performed independently of the utterances in the corpus of source utterances.
  - 16. The method of claim 14 further comprising:
    - augmenting the corpus of source utterances with additional utterances; and
      
      augmenting the graph including augmenting the first part of the graph to encode the additional utterances, and linking the augmented first part to the second part without modifying the second part based on the additional utterances.
  - 17. The method of claim 1 wherein the graph is associated with a finite-state transducer which accepts input symbols that include unit labels and transition labels, and that produces identifiers of segments of the source utterances, and wherein searching the graph is equivalent to composing a finite-state transducer representation of a pronunciation of the target utterance with the finite-state transducer with which the graph is associated.

18. Software stored on a computer-readable medium for causing a computer to perform functions comprising selecting segments from a corpus of source utterances for synthesizing a target utterance, wherein selecting the segments comprises:
- searching a graph in which each path through the graph identifies a sequence of segments of the source utterances and a corresponding sequence of unit labels that characterizes a pronunciation of a concatenation of that sequence of segments, each path being associated with a numerical score that characterizes a quality of the sequence of segment;
  
  wherein searching the graph includes matching a pronunciation of the target utterance to paths through the graph, and selecting segments for synthesizing the target utterance based on numerical scores of matching paths through the graph.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Massachusetts Institute of Technology
Original Assignee
Massachusetts Institute of Technology
Inventors
Hetherington, Irvine Lee, Glass, James Robert, Yi, Jon Rong-Wei

Granted Patent

US 7,165,030 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/238
CPC Class Codes

G10L 13/06 Elementary speech units use...

G10L 15/12 using dynamic programming t...

Concatenative speech synthesis using a finite-state transducer

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Concatenative speech synthesis using a finite-state transducer

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links