Method and apparatus for performing text-to-speech conversion in a client/server environment

US 20020103646A1
Filed: 01/29/2001
Published: 08/01/2002
Est. Priority Date: 01/29/2001
Status: Active Grant

First Claim

Patent Images

1. A method for performing text-to-speech conversion comprising the steps of:

analyzing input text and producing therefrom an intermediate representation thereof; and

synthesizing speech output based upon said intermediate representation of said input text, wherein said analyzing and producing step is performed on a server within a client/server environment, and wherein said synthesizing step is performed on a client device which is associated with but distinct from said server.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and apparatus for performing text-to-speech conversion in a client/server environment partitions an otherwise conventional text-to-speech conversion algorithm into two portions: a first “text analysis” portion, which generates from an original input text an intermediate representation thereof; and a second “speech synthesis” portion, which synthesizes speech waveforms from the intermediate representation generated by the first portion (i.e., the text analysis portion). The text analysis portion of the algorithm is executed exclusively on a server while the speech synthesis portion is executed exclusively on a client which may be associated therewith. The client may comprise a hand-held device such as, for example, a cell phone, and the intermediate representation of the input text advantageously comprises at least a sequence of phonemes representative of the input text. Certain audio segment information which is to be used by the speech synthesis portion of the text-to-speech process may be advantageously transmitted by the server to the client, and a cache of such audio segments may then be advantageously maintained at the client (e.g., in the cell phone) for use by the speech synthesis process in order to obtain improved quality of the synthesized speech.

Citations

74 Claims

1. A method for performing text-to-speech conversion comprising the steps of:
- analyzing input text and producing therefrom an intermediate representation thereof; and
  
  synthesizing speech output based upon said intermediate representation of said input text, wherein said analyzing and producing step is performed on a server within a client/server environment, and wherein said synthesizing step is performed on a client device which is associated with but distinct from said server.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50)
- - 2. The method of claim 1 further comprising the step of transmitting said intermediate representation of said input text across a communications channel from said server to said client device.
  - 3. The method of claim 2 wherein said communications channel comprises a wireless communications channel and wherein said client device comprises a wireless communications device.
  - 4. The method of claim 3 wherein said client device comprises a cell phone.
  - 5. The method of claim 2 wherein said synthesizing step produces said speech output based upon a set of acoustic units, one or more of said acoustic units having been stored in a cache memory within said client device, the method further comprising the steps of transmitting one or more of said acoustic units across said communications channel from said server to said client device and storing said one or more acoustic units in said cache memory.
  - 6. The method of claim 5 wherein said one or more of said acoustic units which are transmitted from said server system to said client system are determined based on said input text and on a model of said cache memory of said client device which is maintained on said server.
  - 7. The method of claim 1 further comprising the step of storing said intermediate representation of said input text on a storage device and wherein said synthesizing step retrieves said intermediate representation of said input text from said storage device.
  - 8. The method of claim 7 wherein said intermediate representation of said input text comprises at least a representation of a sequence of phonemes representative of said input text.
  - 9. The method of claim 8 wherein said intermediate representation further comprises one or more acoustic units.
  - 10. The method of claim 1 wherein said input text comprises e-mail and wherein said synthesizing step is performed upon access of said e-mail by an intended recipient thereof.
  - 11. The method of claim 1 wherein said intermediate representation of said input text comprises at least a representation of a sequence of phonemes representative of said input text.
  - 12. The method of claim 11 wherein said intermediate representation of said input text further comprises a set of corresponding time durations associated with said sequence of phonemes.
  - 13. The method of claim 11 wherein said intermediate representation of said input text further comprises a set of corresponding pitch levels associated with said sequence of phonemes.
  - 15. The method of claim 14 wherein the providing step comprises transmitting said intermediate representation of said input text across a communications channel from said server to said client device.
  - 16. The method of claim 15 wherein said communications channel comprises a wireless communications channel and wherein said client device comprises a wireless communications device.
  - 17. The method of claim 15 wherein said second portion of said text-to-speech conversion process employs a set of acoustic units, the method further comprising the step of transmitting one or more of said acoustic units across said communications channel from said server to said client device for use thereby.
  - 18. The method of claim 17 wherein said one or more of said acoustic units which are transmitted from said server system to said client system are determined based on said input text and on a model of a cache memory of said client device which is maintained on said server.
  - 19. The method of claim 14 further comprising the step of storing said intermediate representation of said input text on a storage device.
  - 20. The method of claim 19 wherein said intermediate representation of said input text comprises at least a representation of a sequence of phonemes representative of said input text.
  - 21. The method of claim 20 wherein said intermediate representation further comprises one or more acoustic units.
  - 22. The method of claim 14 wherein said input text comprises e-mail and wherein said second portion of said text-to-speech conversion process is to be performed upon access of said e-mail by an intended recipient thereof.
  - 23. The method of claim 14 wherein said intermediate representation of said input text comprises a representation of at least a sequence of phonemes representative of said input text.
  - 24. The method of claim 23 wherein said intermediate representation of said input text further comprises a set of corresponding time durations associated with said sequence of phonemes.
  - 25. The method of claim 23 wherein said intermediate representation of said input text further comprises a set of corresponding pitch levels associated with said sequence of phonemes.
  - 27. The method of claim 26 further comprising the step of receiving said intermediate representation of said input text across a communications channel, said intermediate representation of said input text having been transmitted from said server to said client device.
  - 28. The method of claim 27 wherein said communications channel comprises a wireless communications channel and wherein said client device comprises a wireless communications device.
  - 29. The method of claim 28 wherein said client device comprises a cell phone.
  - 30. The method of claim 27 wherein said synthesizing step produces said speech output based upon a set of acoustic units, one or more of said acoustic units having been stored in a cache memory within said client device, the method further comprising the steps of receiving one or more of said acoustic units which have been transmitted across said communications channel from said server to said client device and storing said one or more acoustic units in said cache memory.
  - 31. The method of claim 26 wherein said intermediate representation of said input text has been stored on a storage device, and wherein said synthesizing step retrieves said intermediate representation of said input text from said storage device.
  - 32. The method of claim 31 wherein said intermediate representation of said input text comprises at least a representation of a sequence of phonemes representative of said input text.
  - 33. The method of claim 32 wherein said intermediate representation further comprises one or more acoustic units.
  - 34. The method of claim 26 wherein said input text comprises e-mail and wherein said synthesizing step is performed upon access of said e-mail by an intended recipient thereof.
  - 35. The method of claim 26 wherein said intermediate representation of said input text comprises a representation of at least a sequence of phonemes representative of said input text.
  - 36. The method of claim 35 wherein said intermediate representation of said input text further comprises a set of corresponding time durations associated with said sequence of phonemes.
  - 37. The method of claim 35 wherein said intermediate representation of said input text further comprises a set of corresponding pitch levels associated with said sequence of phonemes.
  - 39. The system of claim 38 further comprising means for transmitting said intermediate representation of said input text across a communications channel from said server to said client device.
  - 40. The system of claim 39 wherein said communications channel comprises a wireless communications channel and wherein said client device comprises a wireless communications device.
  - 41. The system of claim 40 wherein said client device comprises a cell phone.
  - 42. The system of claim 39 wherein said speech synthesis module produces said speech output based upon a set of acoustic units, one or more of said acoustic units having been stored in a cache memory within said client device, the system further comprising means for transmitting one or more of said acoustic units across said communications channel from said server to said client device and means for storing said one or more acoustic units in said cache memory.
  - 43. The system of claim 42 wherein said one or more of said acoustic units which are transmitted from said server system to said client system are determined based on said input text and on a model of said cache memory of said client device which is maintained on said server.
  - 44. The system of claim 38 further comprising means for storing said intermediate representation of said input text on a storage device and wherein said speech synthesis module retrieves said intermediate representation of said input text from said storage device.
  - 45. The system of claim 44 wherein said intermediate representation of said input text comprises at least a representation of a sequence of phonemes representative of said input text.
  - 46. The system of claim 45 wherein said intermediate representation further comprises one or more acoustic units.
  - 47. The system of claim 38 wherein said input text comprises e-mail and wherein said speech synthesis module executes upon access of said e-mail by an intended recipient thereof.
  - 48. The system of claim 38 wherein said intermediate representation of said input text comprises a representation of at least a sequence of phonemes representative of said input text.
  - 49. The system of claim 48 wherein said intermediate representation of said input text further comprises a set of corresponding time durations associated with said sequence of phonemes.
  - 50. The system of claim 48 wherein said intermediate representation of said input text further comprises a set of corresponding pitch levels associated with said sequence of phonemes.

14. A method for performing a first portion of a text-to-speech conversion process, the method executed on a server within a client/server environment and comprising the steps of:
- analyzing input text and producing therefrom an intermediate representation thereof; and
  
  providing said intermediate representation of said input text for use by a second portion of said text-to-speech conversion process which is to be executed on a client device associated with but distinct from said server, said method not comprising any synthesis of speech output.

26. A method for performing a second portion of a text-to-speech conversion process, the method executed on a client device within a client/server environment and comprising the step of synthesizing speech output based upon an intermediate representation of input text, said intermediate representation of said input text having been produced by a first portion of said text-to-speech conversion process executed on a server which is associated with but distinct from said client device.

38. A system for performing text-to-speech conversion comprising:
- a text analysis module which analyzes input text and produces therefrom an intermediate representation thereof; and
  
  a speech synthesis module which synthesizes speech output based upon said intermediate representation of said input text, wherein said text analysis module resides on a server within a client/server environment, and wherein said speech synthesis module resides on a client device which is associated with but distinct from said server.

51. A server within a client/server environment which performs a first portion of a text-to-speech conversion process, the server comprising:
- a text analysis module which analyzes input text and produces therefrom an intermediate representation thereof; and
  
  means for providing said intermediate representation of said input text for use by a second portion of said text-to-speech conversion process which is to be executed on a client device associated with but distinct from said server, said server not performing any synthesis of speech output.
- View Dependent Claims (52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62)
- - 52. The server of claim 51 wherein the means for providing comprises means for transmitting said intermediate representation of said input text across a communications channel from said server to said client device.
  - 53. The server of claim 52 wherein said communications channel comprises a wireless communications channel and wherein said client device comprises a wireless communications device.
  - 54. The server of claim 52 wherein said second portion of said text-to-speech conversion process employs a set of acoustic units, the server further comprising means for transmitting one or more of said acoustic units across said communications channel from said server to said client device for use thereby.
  - 55. The server of claim 54 wherein said one or more of said acoustic units which are to be transmitted from said server system to said client system are determined based on said input text and on a model of a cache memory of said client device which is maintained on said server.
  - 56. The server of claim 51 further comprising means for storing said intermediate representation of said input text on a storage device.
  - 57. The server of claim 56 wherein said intermediate representation of said input text comprises at least a representation of a sequence of phonemes representative of said input text.
  - 58. The server of claim 57 wherein said intermediate representation further comprises one or more acoustic units.
  - 59. The server of claim 51 wherein said input text comprises e-mail and wherein said second portion of said text-to-speech conversion process is to be performed upon access of said e-mail by an intended recipient thereof.
  - 60. The server of claim 51 wherein said intermediate representation of said input text comprises a representation of at least a sequence of phonemes representative of said input text.
  - 61. The server of claim 60 wherein said intermediate representation of said input text further comprises a set of corresponding time durations associated with said sequence of phonemes.
  - 62. The server of claim 60 wherein said intermediate representation of said input text further comprises a set of corresponding pitch levels associated with said sequence of phonemes.

63. A client device within a client/server environment which performs a second portion of a text-to-speech conversion process, the client device comprising a speech synthesis module which synthesizes speech output based upon an intermediate representation of input text, said intermediate representation of said input text having been produced by a first portion of said text-to-speech conversion process executed on a server which is associated with but distinct from said client device.
- View Dependent Claims (64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74)
- - 64. The client device of claim 63 further comprising means for receiving said intermediate representation of said input text across a communications channel, said intermediate representation of said input text having been transmitted from said server to said client device.
  - 65. The client device of claim 64 wherein said communications channel comprises a wireless communications channel and wherein said client device comprises a wireless communications device.
  - 66. The client device of claim 65 wherein said client device comprises a cell phone.
  - 67. The client device of claim 64 wherein said speech synthesis module produces said speech output based upon a set of acoustic units, one or more of said acoustic units having been stored in a cache memory within said client device, the client device further comprising means for receiving one or more of said acoustic units which have been transmitted across said communications channel from said server to said client device and means for storing said one or more acoustic units in said cache memory.
  - 68. The client device of claim 63 wherein said intermediate representation of said input text has been stored on a storage device, and wherein said speech synthesis module retrieves said intermediate representation of said input text from said storage device.
  - 69. The client device of claim 68 wherein said intermediate representation of said input text comprises at least a representation of a sequence of phonemes representative of said input text.
  - 70. The client device of claim 69 wherein said intermediate representation further comprises one or more acoustic units.
  - 71. The client device of claim 63 wherein said input text comprises e-mail and wherein said speech synthesis module is executed upon access of said e-mail by an intended recipient thereof.
  - 72. The client device of claim 63 wherein said intermediate representation of said input text comprises a representation of at least a sequence of phonemes representative of said input text.
  - 73. The client device of claim 72 wherein said intermediate representation of said input text further comprises a set of corresponding time durations associated with said sequence of phonemes.
  - 74. The client device of claim 72 wherein said intermediate representation of said input text further comprises a set of corresponding pitch levels associated with said sequence of phonemes.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Lucent Technologies, Inc. (Nokia Corporation)
Original Assignee
Lucent Technologies, Inc. (Nokia Corporation)
Inventors
Olive, Joseph Philip, Kochanski, Gregory P., Shih, Chi-Lin

Granted Patent

US 6,625,576 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/260
CPC Class Codes

G10L 13/047 Architecture of speech synt...

G10L 13/08 Text analysis or generation...

Method and apparatus for performing text-to-speech conversion in a client/server environment

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

74 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for performing text-to-speech conversion in a client/server environment

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

74 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links