Synthesising speech by converting phonemes to digital waveforms
First Claim
1. A method of converting an input signal representing a text in phonemes into an output digital waveform signal convertible into an acoustic waveform corresponding to said text, wherein said method comprises:
- (a) dividing said input signal into input segments, each of which is stored in an access section of a linked dtabase;
(b) for each input segment identified in step (a), retrieving an output segment of said digital waveform from an output section of the database, said output segment being that which is linked to the input segment; and
(c) joining the digital output segments retrieved in step (b), said output segments being kept in the same order as the respectively associated input segments whereby the resulting output digital signal is a waveform corresponding to the input signal waveform;
the output section of the database containing an extended digital waveform containing plural contextual occurrences of each of plural phonemes in extended speech representing signals of the phonemes to be converted and having a location parameter for identifying any point therein whereby the establishment of beginning and ending location parameters defines a portion of said extended digital waveform;
step (a) including establishing beginning and ending location parameters for segments of the input signal; and
step (c) including utilizing the parameters established in step (a) for retrieving a portion of stored digital waveform.
1 Assignment
0 Petitions
Accused Products
Abstract
This invention relates to the generation of synthetic speech and specifically to the production of a digital waveform from a text in phonemes. The invention uses a linked database which comprises an extended text in phonemes and its equivalent in the form of a digital waveform. The two portions of the database are linked by a parameter which establishes equivalent points in both the phoneme text and the digital waveform. The input text (in phonemes) is analyzed to locate matching portion in the phoneme portion of the database. This matching utilises exact equivalence of phonemes where this is possible; otherwise relation between phonemes is utilised. The selection process identifies input phonemes in context whereby improved conversions are obtained. Having analyzed the input text into matching strings in the input form of the database beginning and ending parameters for the sections are established. The output text is produced by abutting sections of the digital waveform and defined by the beginning and ending parameters.
-
Citations
13 Claims
-
1. A method of converting an input signal representing a text in phonemes into an output digital waveform signal convertible into an acoustic waveform corresponding to said text, wherein said method comprises:
-
(a) dividing said input signal into input segments, each of which is stored in an access section of a linked dtabase;
(b) for each input segment identified in step (a), retrieving an output segment of said digital waveform from an output section of the database, said output segment being that which is linked to the input segment; and
(c) joining the digital output segments retrieved in step (b), said output segments being kept in the same order as the respectively associated input segments whereby the resulting output digital signal is a waveform corresponding to the input signal waveform;
the output section of the database containing an extended digital waveform containing plural contextual occurrences of each of plural phonemes in extended speech representing signals of the phonemes to be converted and having a location parameter for identifying any point therein whereby the establishment of beginning and ending location parameters defines a portion of said extended digital waveform;
step (a) including establishing beginning and ending location parameters for segments of the input signal; and
step (c) including utilizing the parameters established in step (a) for retrieving a portion of stored digital waveform. - View Dependent Claims (2, 3, 4)
(i) a top level containing single phonemes corresponding to a central phoneme of a window;
(ii) a second level which contains equivalents of the second and fourth phonemes of a window; and
(iii) a lowest level which contains the equivalents of the first and fifth phonemes of the window, whereby identification of a portion of the lowest level identifies a stored window of phonemes;
and wherein the comparing comprises;
selecting an exact match for the central phoneme of an input window from the top level of the hierarchy, selecting a best match for phonemes 2 and 4 from the second level of the hierarchy corresponding to the selected portion of the top level of the hierarchy and, finally, selecting from the lowest level of the hierarchy the best match for phonemes 1 and 5 from that portion of the lowest level which corresponds to the selection in the second level of the hierarchy.
-
-
5. A method of converting an input signal into an output signal, wherein:
-
(a) said input signal represents a text in phonemes;
(b) said output signal is a digital waveform convertible into an acoustic waveform corresponding to said text;
(c) a database is used having an input section and an output section;
(d) said output section containing an extended digital waveform having a location parameter for identifying any point therein whereby the establishment of beginning and ending location parameters defines a portion of said extended digital waveform;
(e) said input section containing segments of an extended phoneme text corresponding to the extended waveform contained in the output section;
said method comprising the steps of;
(i) dividing said input signal into input segments;
(ii) matching said input segments with segments contained in the input section of the database thereby establishing beginning and ending location parameters;
(iii) retrieving from the output section of said database segments of extended digital waveform corresponding to said beginning and ending location parameters; and
(iv) joining the output segments of digital waveform so retrieved, said segments being kept in the same order as the corresponding input segments.
-
-
6. A method of converting an input signal into an output signal, wherein:
-
(a) said input signal represents an input text in phonemes;
(b) said output signal is a digital waveform convertible into an acoustic waveform corresponding to said input text;
(c) a database is used having an input section and an output section;
(d) said output section containing an extended digital waveform having a location parameter for identifying any point therein whereby the establishment of beginning and ending location parameters defines a portion of said extended digital waveform;
(e) said input section defining context windows of an extended phoneme text corresponding to the extended waveform contained in the output section;
said method comprising the steps of;
(i) dividing said input signal into input segments;
(ii) matching said input segments with context windows contained in the input section of the database thereby establishing beginning and ending location parameters;
(iii) retrieving from the output section of said database segments of extended waveform corresponding to said beginning and ending location parameters; and
(iv) joining the output segments of a digital waveform, said joined segments being kept in the same order as the corresponding input segments. - View Dependent Claims (7, 8)
the context windows are stored in three hierarchical levels comprising;
(i) a top level defining single phonemes corresponding to the third phoneme of a window;
(ii) a second level which defines equivalents of the second and fourth phonemes of a window; and
(iii) a lowest level which defines equivalents of the first and fifth phonemes of the window, whereby identification of a portion of the lowest level identifies a stored window of phonemes; and
the matching step comprises;
selecting an exact match for the third phoneme of the input window from a first level of the hierarchy, selecting a best match for the second and fourth phonemes from a second level of the hierarchy corresponding to the earlier selected portion of the top level of the hierarchy and, finally, selecting from the lowest level of the hierarchy a best match for the first and fifth phonemes from that portion of the lowest level which corresponds to the earlier selection in the second level of the hierarchy.
-
-
9. A method of converting a string of input phoneme text signals into an output digital waveform signal representing acoustic speech, said method comprising the steps of:
-
(a) storing extended digital speech waveform signals, representing plural utterances of each phoneme to be converted, in a corresponding plurality of speech contexts with different preceding and/or succeeding phonemes;
(b) dividing an input string of phonemes into input subsets of N contiguous phonemes, N being an integer;
(c) matching each said input subset with a most similar corresponding subset of N contiguous phonemes in said stored extended digital speech waveform;
(d) selecting a portion of the stored extended digital speech waveform corresponding to at least one phoneme of the match subset; and
repeating at least steps (c) and (d) while concatenating the thus-selected portions of the extended digital speech waveform to provide said converted output digital waveform signal representing acoustic speech.- View Dependent Claims (10, 11)
N equals an odd integer equal to three or greater and wherein a hierarchical database is maintained with;
(i) a top level containing single phonemes corresponding to the center of (N+1)/2 phoneme of each subset;
(ii) at least one lower level containing plural phonemes that are contiguous to the center phoneme of each subset; and
said matching step includes exactly matching a single input phoneme of a subset at the top level of the hierarchical database but only best approximating a match at the lower level(s) of the hierarchical database.
-
-
12. A method for converting an input signal representing an input text in phonemes into an output digital waveform signal which is, in turn, convertible into an acoustic waveform corresponding to said input text, said method utilizing a linked database having an output section containing an extended digital waveform corresponding to an extended text in phonemes, said text including plural occurrences of individual phonemes in different contexts whereby the extended digital waveform includes plural digital waveforms for the same phoneme in different contexts and said linked database having a location parameter for identifying any point in said extended text and an equivalent point in the extended digital waveform, whereby the establishment of beginning and ending parameters in the extended text defines a portion of said digital waveform, said method including:
-
(a) dividing said input signal into input segments corresponding to portions of digital waveform contained in the output section of the linked database;
(b) establishing beginning and ending parameters for input segments identified in step (a);
(c) utilizing parameters established in step (b) for retrieving portions of stored digital waveform; and
(d) joining the portions retrieved in step (c) in the same order as the respective input segments to produce said output digital waveform signal convertible into said acoustic waveform.
-
-
13. A method for converting an input signal representing an input text in phonemes into an output digital waveform signal which is, in turn, convertible into an acoustic waveform corresponding to said text, said method utilizing a linked database having an input section and an output section wherein the input section contains signals representing an extended text in phonemes including plural occurrences of individual phonemes in different contexts and the output section contains an extended digital waveform corresponding to the extended text of the input section of the database and having a location parameter for identifying any point in said extended text whereby the establishment of beginning and ending parameters defines a portion of said digital waveform, said method including:
-
(a) dividing said input signal into input segments containing input phonemes;
(b) comparing said input phonemes with the extended text contained in the input section of the database to identify the plural occurrences of said input phonemes and selecting from said plural occurrences of said input phonemes closest contexts based on the respective input segments, whereby beginning and ending parameters corresponding to input phonemes are established;
(c) utilizing the parameters established in step (b) for retrieving portions of stored digital waveform corresponding to input phonemes;
(d) joining the portions retrieved in step (c) in the same order as the respective input phonemes to produce said output digital waveform signal convertible into said acoustic waveform.
-
Specification