Synthesis-based pre-selection of suitable units for concatenative speech
First Claim
1. A method of synthesizing speech from text input using unit selection, the method comprising the steps of:
- a) creating a triphone preselection database from an input stream of speech synthesis by collecting units observed to occur in particular triphone contexts, a triphone comprising a sequence of three phoneme units;
b) receiving a stream of input text to be synthesized;
c) converting the received input text into a sequence of phonemes by parsing the input text into identifiable syntactic phrases;
d) comparing the sequence of phonemes formed in step c), also considering neighboring phonemes so as to form input triphones, to a plurality of commonly occurring triphones stored in the triphone preselection database to select a plurality of N phoneme units as candidates for synthesis;
e) selecting a set of candidates of step d) by applying a cost process to each path through the plurality of N phoneme units associated with each phoneme sequence and choosing a least cost set of phoneme units;
f) processing the least cost phoneme units selected in step e) into synthesized speech; and
g) outputting the synthesized speech to an output device.
6 Assignments
0 Petitions
Accused Products
Abstract
A method and system for providing concatenative speech uses a speech synthesis input to populate a triphone-indexed database that is later used for searching and retrieval to create a phoneme string acceptable for a text-to-speech operation. Prior to initiating the “real time” synthesis, a database is created of all possible triphone contexts by inputting a continuous stream of speech. The speech data is then analyzed to identify all possible triphone sequences in the stream, and the various units chosen for each context. During a later text-to-speech operation, the triphone contexts in the text are identified and the triphone-indexed phonemes in the database are searched to retrieve the best-matched candidates.
333 Citations
10 Claims
-
1. A method of synthesizing speech from text input using unit selection, the method comprising the steps of:
-
a) creating a triphone preselection database from an input stream of speech synthesis by collecting units observed to occur in particular triphone contexts, a triphone comprising a sequence of three phoneme units;
b) receiving a stream of input text to be synthesized;
c) converting the received input text into a sequence of phonemes by parsing the input text into identifiable syntactic phrases;
d) comparing the sequence of phonemes formed in step c), also considering neighboring phonemes so as to form input triphones, to a plurality of commonly occurring triphones stored in the triphone preselection database to select a plurality of N phoneme units as candidates for synthesis;
e) selecting a set of candidates of step d) by applying a cost process to each path through the plurality of N phoneme units associated with each phoneme sequence and choosing a least cost set of phoneme units;
f) processing the least cost phoneme units selected in step e) into synthesized speech; and
g) outputting the synthesized speech to an output device. - View Dependent Claims (2, 3, 4, 5)
1) providing a continuous input stream of synthesized speech for a predetermined time period t;
2) parsing the speech input stream into phoneme units;
3) finding the unique database unit number with each phoneme;
4) identifying all possible triphone combinations from the parsed phonemes; and
5) tabulating unit numbers for the identified phonemes so as to index the database by the identified triphones.
-
-
3. The method as defined in claim 2 wherein in performing step a1), the continuous input stream continues for a time period of approximately two weeks.
-
4. The method as defined in claim 1 wherein in performing step c), the converting process uses half-phonemes to create phoneme sequences, with unit spacing between adjacent half-phonemes.
-
5. The method as defined in claim 1 wherein in performing step e), a Viterbi search mechanism is used.
-
6. A method of creating a triphone preselection database for use in generating synthesized speech from a stream of input text, the method comprising the steps of:
-
a) providing a continuous input stream of synthesized speech for a predetermined time period t;
b) parsing the speech input stream into phoneme units;
c) finding the unique database unit number associated with each phoneme;
d) identifying all possible triphone combinations from the parsed phonemes; and
e) tabulating unit numbers for the identified phonemes so as to index the database by the identified triphones. - View Dependent Claims (7)
-
-
8. A system for synthesizing speech using phonemes, comprising
a linguistic processor for receiving input text and converting said text into a sequence of phonemes; -
a database of indexed phonemes, the index based on precalculated costs of phonemes in various triphone sequences;
a unit selector, coupled to both the linguistic process and the triphone database, for comparing each received phoneme, including its triphone context, to the indexed phonemes in said database and selecting a set of candidate phonemes for synthesis; and
a speech processor, coupled to the unit selector, for processing selected candidate phonemes into synthesized speech and providing as an output the synthesized speech to an output device. - View Dependent Claims (9, 10)
-
Specification