Concatenative speech synthesis using a finite-state transducer

US 7,165,030 B2
Filed: 09/17/2001
Issued: 01/16/2007
Est. Priority Date: 09/17/2001
Status: Expired due to Fees

First Claim

Patent Images

1. A method for selecting segments from a corpus of source utterances for synthesizing a target utterance, comprising:

searching a precomputed graph in which each path through the graph identifies a sequence of segments of the corpus of source utterances and a corresponding sequence of unit labels that characterizes a pronunciation of a concatenation of that sequence of segments, each path a numerical score that characterizes a quality of the sequence of segments;

wherein searching the precomputed graph includes matching a pronunciation of the target utterance to paths through the graph, and selecting segments for synthesizing the target utterance based on numerical scores of matching paths through the graph;

the precomputed graph includes a first part that encodes a sequence of segments and a corresponding sequence of unit labels for each of the source utterances, and a second part computed in advance of run-time when the target utterance is known that includes paths for coupling segments of the source utterances and encodes allowable transitions between segments of different source utterances and encodes a transition score for each of those transitions; and

matching the pronunciation of the target utterance to paths through the graph includes considering paths in which each transition between segments of different source utterances identified by that path corresponds to a different subpath of that path that passes through the second part of the graph.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method for concatenative speech synthesis includes a processing stage that selects segments based on their symbolic labeling in an efficient graph-based search, which uses a finite-state transducer formalism. This graph-based search uses a representation of concatenation constraints and costs that does not necessarily grow with the size of the source corpus thereby limiting the increase in computation required for the search as the size of the source corpus increases. In one application of this method, multiple alternative segment sequences are generated and a best segment sequence is then be selected using characteristics that depend on specific signal characteristics of the segments.

66 Citations

View as Search Results

14 Claims

1. A method for selecting segments from a corpus of source utterances for synthesizing a target utterance, comprising:
- searching a precomputed graph in which each path through the graph identifies a sequence of segments of the corpus of source utterances and a corresponding sequence of unit labels that characterizes a pronunciation of a concatenation of that sequence of segments, each path a numerical score that characterizes a quality of the sequence of segments;
  
  wherein searching the precomputed graph includes matching a pronunciation of the target utterance to paths through the graph, and selecting segments for synthesizing the target utterance based on numerical scores of matching paths through the graph;
  
  the precomputed graph includes a first part that encodes a sequence of segments and a corresponding sequence of unit labels for each of the source utterances, and a second part computed in advance of run-time when the target utterance is known that includes paths for coupling segments of the source utterances and encodes allowable transitions between segments of different source utterances and encodes a transition score for each of those transitions; and
  
  matching the pronunciation of the target utterance to paths through the graph includes considering paths in which each transition between segments of different source utterances identified by that path corresponds to a different subpath of that path that passes through the second part of the graph.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
- - 2. The method of claim 1 wherein selecting segments for synthesizing the target utterance includes identifying a path through the graph that matches the pronunciation of the target utterance and selecting the sequence of segments that is identified by the determined path.
  - 3. The method of claim 2 wherein determining the path includes determining a best scoring path through the graph.
  - 4. The method of claim 3 wherein determining the best scoring path involves using a dynamic programming algorithm.
  - 5. The method of claim 2 farther comprising concatenating the selected sequence of segments to form a waveform representation of the target utterance.
  - 6. The method of claim 1 wherein selecting the segments for synthesizing the target utterance includes determining a plurality of paths through the graph that each matches the representation of the pronunciation of the target utterance.
  - 7. The method of claim 6 wherein selecting the segments further includes forming a plurality of sequences of segments, each associated with a different one of the plurality of paths.
  - 8. The method of claim 7 wherein selecting the segments further includes selecting one of the sequences of segments based on characteristics of those sequences of segments not determined by the corresponding sequences of unit labels associated with those sequences.
  - 9. The method of claim 1 further comprising forming a representation of a plurality of pronunciations of the target utterance, and wherein searching the graph includes matching any of the pronunciations of the target utterance to paths through the graph.
  - 10. The method of claim 1 further comprising forming a representation of the pronunciation of the target utterance in terms of alternating unit labels and transitions labels.
  - 11. The method of claim 1, wherein selecting the segments for synthesis includes evaluating a score for each of the considered paths that is based on the transition scores associated with the subpaths through the second part of the graph.
  - 12. The method of claim 1 wherein a size of the second part of the graph is substantially independent of a size of the source corpus, and a complexity of matching the pronunciation through the graph grows less than linearly with the size of the corpus.
  - 13. The method of claim 1 wherein the graph is associated with a finite-state transducer which accepts input symbols that include unit labels and transition labels, and that produces identifiers of segments of the source utterances, and wherein searching the graph is equivalent to composing a finite-state transducer representation of a pronunciation of the target utterance with the finite-state transducer with which the graph is associated.

14. A computer-readable medium comprising instructions for causing a computer to perform functions comprising selecting segments from a corpus of source utterances for synthesizing a target utterance, wherein selecting the segments comprises:
- searching a precomputed graph in which each path through the graph identifies a sequence of segments of the corpus of source utterances and a corresponding sequence of unit and transition labels that characterizes a pronunciation of a concatenation of that sequence of segments, each path being associated with a numerical score that characterizes a quality of the sequence of segments;
  
  wherein searching the precomputed graph includes matching a pronunciation of the target utterance represented by unit labels and transition labels to paths through the graph, and selecting segments for synthesizing the target utterance based on the numerical scores of matching paths through the graph;
  
  the precomputed graph includes a first part that encodes a sequence of segments and a corresponding sequence of unit labels for each of the source utterances, and a second part computed in advance of run-time when the target utterance is known that includes paths for coupling segments of the source utterances and encodes allowable transitions between segments of different source utterances and encodes a transition score for each of those transitions; and
  
  matching the pronunciation of the target utterance to paths through the graph includes considering paths in which each transition between segments of different source utterances identified by that path corresponds to a different subpath of that path that passes through the second part of the graph.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Massachusetts Institute of Technology
Original Assignee
Massachusetts Institute of Technology
Inventors
Hetherington, Irvine Lee, Glass, James Robert, Yi, Jon Rong-Wei
Primary Examiner(s)
Dorvil; Richemond
Assistant Examiner(s)
Vo; Huyen X.

Application Number

US09/954,979
Publication Number

US 20030055641A1
Time in Patent Office

1,947 Days
Field of Search

704/260, 704/256, 704/242, 704/239, 704/259, 704/258, 704/261, 704/262, 704/263, 704/266, 704/270, 704/271, 704/238
US Class Current

704/238
CPC Class Codes

G10L 13/06 Elementary speech units use...

G10L 15/12 using dynamic programming t...

Concatenative speech synthesis using a finite-state transducer

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

66 Citations

14 Claims

Specification

Use Cases

Quick Links

Others

Concatenative speech synthesis using a finite-state transducer

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

66 Citations

14 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others