Method and apparatus for performing text-to-speech conversion in a client/server environment

US 6,625,576 B2
Filed: 01/29/2001
Issued: 09/23/2003
Est. Priority Date: 01/29/2001
Status: Expired due to Term

First Claim

Patent Images

1. A method for performing text-to-speech conversion comprising the steps of:

analyzing input text and producing therefrom an intermediate representation thereof; and

synthesizing speech output based upon said intermediate representation of said input text, wherein said analyzing and producing step is performed on a server within a client/server environment, and wherein said synthesizing step is performed on a client device which is associated with but distinct from said server, wherein said synthesizing step produces said speech output further based upon a set of acoustic units comprised in a dynamic cache memory associated with said client device, the method further comprising the steps of;

selecting a subset of acoustic units from an acoustic unit database associated with said server, wherein said subset of acoustic units is selected based on said intermediate representation of said input text and on a determination of which acoustic units will be needed and which acoustic units will not be needed to synthesize the speech output from the intermediate representation of said input text;

transmitting one or more of said acoustic units comprised in said Subset across a communications channel from said server to said client device; and

storing said one or more of said acoustic units in said dynamic cache memory.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and apparatus for performing text-to-speech conversion in a client/server environment partitions an otherwise conventional text-to-speech conversion algorithm into two portions: a first “text analysis” portion, which generates from an original input text an intermediate representation thereof and a second “speech synthesis” portion, which synthesizes speech waveforms from the intermediate representation generated by the first portion (i.e., the text analysis portion) The text analysis portion of the algorithm is executed exclusively on a server while the speech synthesis portion is executed exclusively on a client which may be associated therewith. The client may comprise a hand-held device such as, for example, a cell phone, and the intermediate representation of the input text advantageously comprises at least a sequence of phonemes representative of the input text. Certain audio segment information which is to be used by the speech synthesis portion of the text-to-speech process may be advantageously transmitted by the server to the client, and a cache of such audio segments may then be advantageously maintained at the client (e.g., in the cell phone) for use by the speech synthesis process in order to obtain improved quality of the synthesized speech.

Citations

46 Claims

1. A method for performing text-to-speech conversion comprising the steps of:
- analyzing input text and producing therefrom an intermediate representation thereof; and
  
  synthesizing speech output based upon said intermediate representation of said input text, wherein said analyzing and producing step is performed on a server within a client/server environment, and wherein said synthesizing step is performed on a client device which is associated with but distinct from said server, wherein said synthesizing step produces said speech output further based upon a set of acoustic units comprised in a dynamic cache memory associated with said client device, the method further comprising the steps of;
  
  selecting a subset of acoustic units from an acoustic unit database associated with said server, wherein said subset of acoustic units is selected based on said intermediate representation of said input text and on a determination of which acoustic units will be needed and which acoustic units will not be needed to synthesize the speech output from the intermediate representation of said input text;
  
  transmitting one or more of said acoustic units comprised in said Subset across a communications channel from said server to said client device; and
  
  storing said one or more of said acoustic units in said dynamic cache memory.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The method of claim 1 further comprising the step of transmitting said intermediate representation of said input text across a communications channel from said server to said client device.
  - 3. The method of claim 2 wherein said communications channel comprises a wireless communications channel and wherein said client device comprises a wireless communications device.
  - 4. The method of claim 3 wherein said client device comprises a cell phone.
  - 5. The method of claim 1 wherein said one or more of said acoustic units which are transmitted from said server system to said client system are determined based on a model of said cache memory associated with said client device which is maintained in association with said server.
  - 6. The method of claim 1 further comprising the step of storing said intermediate representation of said input text on a storage device and wherein said synthesizing step retrieves said intermediate representation of said input text from'"'"'said storage device.
  - 7. The method of claim 6 wherein said intermediate representation of said input text comprises at least a representation of a sequence of phonemes representative of said input text.
  - 8. The method of claim 7 wherein said intermediate representation further comprises one or more acoustic units.
  - 9. The method of claim 1 wherein said input text comprises e-mail and wherein said synthesizing step is performed upon access of said e-mail by an intended recipient thereof.
  - 10. The method of claim 1 wherein said intermediate representation of said input text comprises at least a representation of a sequence of phonemes representative of said input text.
  - 11. The method of claim 10 wherein said intermediate representation of said input text further comprises a set of corresponding time durations associated with said sequence of phonemes.
  - 12. The method of claim 10 wherein said intermediate representation of said input text further comprises a set of corresponding pitch levels associated with said sequence of phonemes.

13. A method for performing a second portion of a text-to-speech conversion process, the method executed on a client device within a client/server environment and comprising the step of synthesizing speech output based upon an intermediate representation of input text, said intermediate representation of said input text having been produced by a first portion of said text-to-speech conversion process executed on a server which is associated with but distinct from said client device,wherein said synthesizing step produces said speech output further based upon a set of acoustic units comprised in a dynamic cache memory associated with said client device, the method further comprising the steps of:
- receiving one or more acoustic units which have been selected from an acoustic unit database associated with said server and transmitted across a communications channel from said server to said client device, wherein said subset of acoustic units were selected based on said intermediate representation of said input text and on a determination of which acoustic unit will be needed and which acoustic units will not be needed to synthesize the speech output from the intermediate representation of said input text; and
  
  storing said one or more acoustic units in said dynamic cache memory.
- View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 21, 22, 23)
- - 14. The method of claim 13 further comprising the step of receiving said intermediate representation of said input text across a communications channel, said intermediate representation of said input text having been transmitted from said server to said client device.
  - 15. The method of claim 14 wherein said communications channel comprises a wireless communications channel and wherein said client device comprises a wireless communications device.
  - 16. The method of claim 15 wherein said client device comprises a cell phone.
  - 17. The method of claim 13 wherein said intermediate representation of said input text has been stored on a storage device, and wherein said synthesizing step retrieves said intermediate representation of said input text from said storage device.
  - 18. The method of claim 17 wherein said intermediate representation of said input text comprises at least a representation of a sequence of phonemes representative of said input text.
  - 19. The method of claim 18 wherein said intermediate representation further comprises one or more acoustic units.
  - 20. The method of claim 13 wherein said input text comprises e-mail and wherein said synthesizing step is performed upon access of said e-mail by an intended recipient thereof.
  - 21. The method of claim 13 wherein said intermediate representation of said input text comprises a representation of at least a sequence of phonemes representative of said input text.
  - 22. The method of claim 21 wherein said intermediate representation of said input text further comprises a set of corresponding time durations associated with said sequence of phonemes.
  - 23. The method of claim 21 wherein said intermediate representation of said input text further comprises a set of corresponding pitch levels associated with said sequence of phonemes.

24. A system for performing text-to-speech conversion comprising:
- a text analysis module which analyzes input text and produces therefrom an intermediate representation thereof; and
  
  a speech synthesis module which synthesizes speech output based upon said intermediate representation of said input text, wherein said text analysis module resides on a server within a client/server environment, and wherein said speech synthesis module resides on a client device which is associated with but distinct from said server. wherein said speech synthesis module produces said speech output further based upon a set acoustic units comprised in a dynamic cache memory associated with said client device, the system further comprising;
  
  means for selecting a subset of acoustic units from an acoustic unit database associated with said server, wherein said subset of acoustic units is selected based on said intermediate representation of said input text and on a determination of which acoustic units will be needed and which acoustic units will not be needed to synthesize the speech output from the intermediate representation of said input text;
  
  means for transmitting one or more of said acoustic units across a communications channel from said server to said client device; and
  
  means for storing said one or more acoustic units in said dynamic cache memory.
- View Dependent Claims (25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35)
- - 25. The system of claim 24 further comprising means for transmitting said intermediate representation of said input text across a communications channel from said server to said client device.
  - 26. The system of claim 25 wherein said communications channel comprises a wireless communications channel and wherein said client device comprises a wireless communications device.
  - 27. The system of claim 26 wherein said client device comprises a cell phone.
  - 28. The system of claim 24 wherein said one or more of said acoustic units which are transmitted from said server system to said client system are determined based on a model of said cache memory associated with said client device which is maintained in association with said server.
  - 29. The system of claim 24 further comprising means for storing said intermediate representation of said input text on a storage device and wherein said speech synthesis module retrieves said intermediate representation of said input text from said storage device.
  - 30. The system of claim 29 wherein said intermediate representation of said input text comprises at least a representation of a sequence of phonemes representative of said input text.
  - 31. The system of claim 30 wherein said intermediate representation further comprises one or more acoustic units.
  - 32. The system of claim 24 wherein said input text comprises e-mail and wherein said speech synthesis module executes upon access of said e-mail by an intended recipient thereof.
  - 33. The system of claim 24 wherein said intermediate representation of said input text comprises a representation of at least a sequence of phonemes representative of said input text.
  - 34. The system of claim 33 wherein said intermediate representation of said input text further comprises a set of corresponding time durations associated with said sequence of phonemes.
  - 35. The system of claim 33 wherein said intermediate representation of said input text further comprises a set of corresponding pitch level associated with said sequence of phonemes.

36. A client device within a client/server environment which performs a second portion of a text-to-speech conversion process, the client device comprising a speech synthesis module which synthesizes speech output based upon an intermediate representation of input text, said intermediate representation of said input text having been produced by a first portion of said text-to-speech conversion process executed on a server which is associated with but distinct from said client device,wherein said speech synthesis module produces said speech output further based upon a set of acoustic units comprised in a dynamic cache memory associated with said client device, the client device further comprising:
- means for receiving one or more acoustic units which have been selected from an acoustic unit database associated with said server and transmitted across a communications channel from said server to said client device, wherein said subset of acoustic units was selected based on said intermediate representation of said input text and on a determination of which acoustic units will be needed and which acoustic units will not be needed to synthesize the speech output from the intermediate representation of said input text; and
  
  means for storing said one or more acoustic units in said dynamic cache memory.
- View Dependent Claims (37, 38, 39, 40, 41, 42, 43, 44, 45, 46)
- - 37. The client device of claim 36 further comprising means for receiving said intermediate representation of said input text across a communications channel said intermediate representation of said input text having been transmitted from said server to said client device.
  - 38. The client device of claim 37 wherein said communications channel comprises a wireless communications channel and wherein said client device comprises a wireless communications device.
  - 39. The client device of claim 38 wherein said client device comprises a cell phone.
  - 40. The client device of claim 36 wherein said intermediate representation of said input text has been stored on a storage device, and wherein said speech synthesis module retrieves said intermediate representation of said input text from said storage device.
  - 41. The client device of claim 40 wherein said intermediate representation of said input text comprises at least a representation of a sequence of phonemes representative of said input text.
  - 42. The client device of claim 41 wherein said intermediate representation further comprises one or more acoustic units.
  - 43. The client device of claim 36 wherein said input text comprises e-mail and wherein said speech synthesis module is executed upon access of said e-mail by an intended recipient thereof.
  - 44. The client device of claim 36 wherein said intermediate representation of said input text comprises a representation of at least a sequence of phonemes representative of said input text.
  - 45. The client device of claim 44 wherein said intermediate representation of said input text further comprises a set of corresponding time durations associated with said sequence of phonemes.
  - 46. The client device of claim 44 wherein said intermediate representation of said input text further comprises a set of corresponding pitch levels associated with said sequence of phonemes.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Lucent Technologies, Inc. (Nokia Corporation)
Original Assignee
Lucent Technologies, Inc. (Nokia Corporation)
Inventors
Kochanski, Gregory P., Shih, Chi-Lin, Olive, Joseph Philip
Primary Examiner(s)
ABEBE, DANIEL DEMELASH

Application Number

US09/772,300
Publication Number

US 20020103646A1
Time in Patent Office

967 Days
Field of Search

704/260, 704/207, 704/258, 704/261, 704/270, 379/67.1, 379/88.16, 379/88.17, 455/413
US Class Current

704/260
CPC Class Codes

G10L 13/047 Architecture of speech synt...

G10L 13/08 Text analysis or generation...

Method and apparatus for performing text-to-speech conversion in a client/server environment

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

46 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for performing text-to-speech conversion in a client/server environment

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

46 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links