Concatenative speech synthesis using a finite-state transducer
First Claim
1. A method for selecting segments from a corpus of source utterances for synthesizing a target utterance, comprising:
- searching a precomputed graph in which each path through the graph identifies a sequence of segments of the corpus of source utterances and a corresponding sequence of unit labels that characterizes a pronunciation of a concatenation of that sequence of segments, each path a numerical score that characterizes a quality of the sequence of segments;
wherein searching the precomputed graph includes matching a pronunciation of the target utterance to paths through the graph, and selecting segments for synthesizing the target utterance based on numerical scores of matching paths through the graph;
the precomputed graph includes a first part that encodes a sequence of segments and a corresponding sequence of unit labels for each of the source utterances, and a second part computed in advance of run-time when the target utterance is known that includes paths for coupling segments of the source utterances and encodes allowable transitions between segments of different source utterances and encodes a transition score for each of those transitions; and
matching the pronunciation of the target utterance to paths through the graph includes considering paths in which each transition between segments of different source utterances identified by that path corresponds to a different subpath of that path that passes through the second part of the graph.
2 Assignments
0 Petitions
Accused Products
Abstract
A method for concatenative speech synthesis includes a processing stage that selects segments based on their symbolic labeling in an efficient graph-based search, which uses a finite-state transducer formalism. This graph-based search uses a representation of concatenation constraints and costs that does not necessarily grow with the size of the source corpus thereby limiting the increase in computation required for the search as the size of the source corpus increases. In one application of this method, multiple alternative segment sequences are generated and a best segment sequence is then be selected using characteristics that depend on specific signal characteristics of the segments.
66 Citations
14 Claims
-
1. A method for selecting segments from a corpus of source utterances for synthesizing a target utterance, comprising:
-
searching a precomputed graph in which each path through the graph identifies a sequence of segments of the corpus of source utterances and a corresponding sequence of unit labels that characterizes a pronunciation of a concatenation of that sequence of segments, each path a numerical score that characterizes a quality of the sequence of segments; wherein searching the precomputed graph includes matching a pronunciation of the target utterance to paths through the graph, and selecting segments for synthesizing the target utterance based on numerical scores of matching paths through the graph; the precomputed graph includes a first part that encodes a sequence of segments and a corresponding sequence of unit labels for each of the source utterances, and a second part computed in advance of run-time when the target utterance is known that includes paths for coupling segments of the source utterances and encodes allowable transitions between segments of different source utterances and encodes a transition score for each of those transitions; and matching the pronunciation of the target utterance to paths through the graph includes considering paths in which each transition between segments of different source utterances identified by that path corresponds to a different subpath of that path that passes through the second part of the graph. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
-
-
14. A computer-readable medium comprising instructions for causing a computer to perform functions comprising selecting segments from a corpus of source utterances for synthesizing a target utterance, wherein selecting the segments comprises:
-
searching a precomputed graph in which each path through the graph identifies a sequence of segments of the corpus of source utterances and a corresponding sequence of unit and transition labels that characterizes a pronunciation of a concatenation of that sequence of segments, each path being associated with a numerical score that characterizes a quality of the sequence of segments; wherein searching the precomputed graph includes matching a pronunciation of the target utterance represented by unit labels and transition labels to paths through the graph, and selecting segments for synthesizing the target utterance based on the numerical scores of matching paths through the graph; the precomputed graph includes a first part that encodes a sequence of segments and a corresponding sequence of unit labels for each of the source utterances, and a second part computed in advance of run-time when the target utterance is known that includes paths for coupling segments of the source utterances and encodes allowable transitions between segments of different source utterances and encodes a transition score for each of those transitions; and matching the pronunciation of the target utterance to paths through the graph includes considering paths in which each transition between segments of different source utterances identified by that path corresponds to a different subpath of that path that passes through the second part of the graph.
-
Specification