Method and apparatus for performing text-to-speech conversion in a client/server environment
First Claim
1. A method for performing text-to-speech conversion comprising the steps of:
- analyzing input text and producing therefrom an intermediate representation thereof; and
synthesizing speech output based upon said intermediate representation of said input text, wherein said analyzing and producing step is performed on a server within a client/server environment, and wherein said synthesizing step is performed on a client device which is associated with but distinct from said server, wherein said synthesizing step produces said speech output further based upon a set of acoustic units comprised in a dynamic cache memory associated with said client device, the method further comprising the steps of;
selecting a subset of acoustic units from an acoustic unit database associated with said server, wherein said subset of acoustic units is selected based on said intermediate representation of said input text and on a determination of which acoustic units will be needed and which acoustic units will not be needed to synthesize the speech output from the intermediate representation of said input text;
transmitting one or more of said acoustic units comprised in said Subset across a communications channel from said server to said client device; and
storing said one or more of said acoustic units in said dynamic cache memory.
3 Assignments
0 Petitions
Accused Products
Abstract
A method and apparatus for performing text-to-speech conversion in a client/server environment partitions an otherwise conventional text-to-speech conversion algorithm into two portions: a first “text analysis” portion, which generates from an original input text an intermediate representation thereof and a second “speech synthesis” portion, which synthesizes speech waveforms from the intermediate representation generated by the first portion (i.e., the text analysis portion) The text analysis portion of the algorithm is executed exclusively on a server while the speech synthesis portion is executed exclusively on a client which may be associated therewith. The client may comprise a hand-held device such as, for example, a cell phone, and the intermediate representation of the input text advantageously comprises at least a sequence of phonemes representative of the input text. Certain audio segment information which is to be used by the speech synthesis portion of the text-to-speech process may be advantageously transmitted by the server to the client, and a cache of such audio segments may then be advantageously maintained at the client (e.g., in the cell phone) for use by the speech synthesis process in order to obtain improved quality of the synthesized speech.
-
Citations
46 Claims
-
1. A method for performing text-to-speech conversion comprising the steps of:
-
analyzing input text and producing therefrom an intermediate representation thereof; and
synthesizing speech output based upon said intermediate representation of said input text, wherein said analyzing and producing step is performed on a server within a client/server environment, and wherein said synthesizing step is performed on a client device which is associated with but distinct from said server, wherein said synthesizing step produces said speech output further based upon a set of acoustic units comprised in a dynamic cache memory associated with said client device, the method further comprising the steps of;
selecting a subset of acoustic units from an acoustic unit database associated with said server, wherein said subset of acoustic units is selected based on said intermediate representation of said input text and on a determination of which acoustic units will be needed and which acoustic units will not be needed to synthesize the speech output from the intermediate representation of said input text;
transmitting one or more of said acoustic units comprised in said Subset across a communications channel from said server to said client device; and
storing said one or more of said acoustic units in said dynamic cache memory. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. A method for performing a second portion of a text-to-speech conversion process, the method executed on a client device within a client/server environment and comprising the step of synthesizing speech output based upon an intermediate representation of input text, said intermediate representation of said input text having been produced by a first portion of said text-to-speech conversion process executed on a server which is associated with but distinct from said client device,
wherein said synthesizing step produces said speech output further based upon a set of acoustic units comprised in a dynamic cache memory associated with said client device, the method further comprising the steps of: -
receiving one or more acoustic units which have been selected from an acoustic unit database associated with said server and transmitted across a communications channel from said server to said client device, wherein said subset of acoustic units were selected based on said intermediate representation of said input text and on a determination of which acoustic unit will be needed and which acoustic units will not be needed to synthesize the speech output from the intermediate representation of said input text; and
storing said one or more acoustic units in said dynamic cache memory. - View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 21, 22, 23)
-
-
24. A system for performing text-to-speech conversion comprising:
-
a text analysis module which analyzes input text and produces therefrom an intermediate representation thereof; and
a speech synthesis module which synthesizes speech output based upon said intermediate representation of said input text, wherein said text analysis module resides on a server within a client/server environment, and wherein said speech synthesis module resides on a client device which is associated with but distinct from said server. wherein said speech synthesis module produces said speech output further based upon a set acoustic units comprised in a dynamic cache memory associated with said client device, the system further comprising;
means for selecting a subset of acoustic units from an acoustic unit database associated with said server, wherein said subset of acoustic units is selected based on said intermediate representation of said input text and on a determination of which acoustic units will be needed and which acoustic units will not be needed to synthesize the speech output from the intermediate representation of said input text;
means for transmitting one or more of said acoustic units across a communications channel from said server to said client device; and
means for storing said one or more acoustic units in said dynamic cache memory. - View Dependent Claims (25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35)
-
-
36. A client device within a client/server environment which performs a second portion of a text-to-speech conversion process, the client device comprising a speech synthesis module which synthesizes speech output based upon an intermediate representation of input text, said intermediate representation of said input text having been produced by a first portion of said text-to-speech conversion process executed on a server which is associated with but distinct from said client device,
wherein said speech synthesis module produces said speech output further based upon a set of acoustic units comprised in a dynamic cache memory associated with said client device, the client device further comprising: -
means for receiving one or more acoustic units which have been selected from an acoustic unit database associated with said server and transmitted across a communications channel from said server to said client device, wherein said subset of acoustic units was selected based on said intermediate representation of said input text and on a determination of which acoustic units will be needed and which acoustic units will not be needed to synthesize the speech output from the intermediate representation of said input text; and
means for storing said one or more acoustic units in said dynamic cache memory. - View Dependent Claims (37, 38, 39, 40, 41, 42, 43, 44, 45, 46)
-
Specification