Client/server architecture for text-to-speech synthesis
First Claim
1. In a computer system comprising a server machine and a client machine, a text-to-speech synthesis method comprising:
- describing a finite number of possible acoustic units;
optimizing a compression method selected in dependence of said finite number of possible acoustic units, wherein said optimizing step further comprises selecting parameters of said compression method utilizing a directed optimized search to minimize the amount of data transmitted between said server machine and said client machine;
compressing said finite number of possible acoustic units via said optimized compression method;
storing said finite number of possible acoustic units as compressed acoustic units in an acoustic unit database accessible to said server machine;
in said server machine, obtaining a normalized text and generating prosody data thereof;
selecting from said acoustic unit database compressed acoustic units that correspond to said normalized text;
transmitting said prosody data and said selected compressed acoustic units from said server machine to said client machine; and
in said client machine, decompressing said transmitted acoustic units and concatenating said decompressed acoustic units in accordance with said prosody data.
2 Assignments
0 Petitions
Accused Products
Abstract
A client/server text-to-speech synthesis system and method divides the method optimally between client and server. The server stores large databases for pronunciation analysis, prosody generation, and acoustic unit selection corresponding to a normalized text, while the client performs computationally intensive decompression and concatenation of selected acoustic units to generate speech. The units are transmitted from the client to the server in a highly compressed format, with a compression method selected based on the predetermined set of potential acoustic units. This compression method allows for very high-quality and natural-sounding speech to be output at the client machine.
-
Citations
32 Claims
-
1. In a computer system comprising a server machine and a client machine, a text-to-speech synthesis method comprising:
-
describing a finite number of possible acoustic units;
optimizing a compression method selected in dependence of said finite number of possible acoustic units, wherein said optimizing step further comprises selecting parameters of said compression method utilizing a directed optimized search to minimize the amount of data transmitted between said server machine and said client machine;
compressing said finite number of possible acoustic units via said optimized compression method;
storing said finite number of possible acoustic units as compressed acoustic units in an acoustic unit database accessible to said server machine;
in said server machine, obtaining a normalized text and generating prosody data thereof;
selecting from said acoustic unit database compressed acoustic units that correspond to said normalized text;
transmitting said prosody data and said selected compressed acoustic units from said server machine to said client machine; and
in said client machine, decompressing said transmitted acoustic units and concatenating said decompressed acoustic units in accordance with said prosody data. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
said decompressing step and said concatenating step begin before all of said selected compressed acoustic units and said prosody data are received in said client machine. -
3. The method of claim 1, further comprising:
-
caching a number of frequently used uncompressed acoustic units in a cache memory of said client machine; and
concatenating said decompressed acoustic units with at least one of said uncompressed acoustic units.
-
-
4. The method of claim 1, further comprising normalizing a standard text to obtain said normalized text.
-
5. The method of claim 1, further comprising:
-
sending a standard text to said server machine;
in said server machine, normalizing said standard text to obtain said normalized text.
-
-
6. The method of claim 1, wherein said optimized search is directed by an acoustic metric that measures quality.
-
7. The method of claim 1, wherein said describing step further comprises:
-
dividing each of said possible acoustic units into sequences of chunks of equal duration; and
describing frequency composition of each chunk with a set of parameters.
-
-
8. A text-to-speech synthesis system programmed to perform the method of claim 1, said text-to-speech synthesis system comprising:
-
said acoustic unit database;
said server machine in communication with said acoustic unit database; and
said client machine in communication with said server machine.
-
-
9. A computer-readable program storage device tangibly embodying a computer-executable program implementing the text-to-speech synthesis method of claim 1.
-
-
10. In a computer system comprising a server machine and a client machine, a text-to-speech synthesis method comprising:
-
in said server machine, obtaining a normalized text;
selecting compressed acoustic units corresponding to said normalized text from a database storing a predetermined number of possible acoustic units that have been optimally compressed;
transmitting said selected compressed acoustic units to said client machine;
generating prosody data corresponding to said normalized text and transmitting said prosody data to said client machine;
in said client machine, decompressing said transmitted acoustic units; and
concatenating said decompressed acoustic units. - View Dependent Claims (11, 12, 13, 14, 15, 16)
determining a compression method in dependence of said predetermined number of possible acoustic units; and
selecting parameters of said compression method utilizing an optimized search directed by an acoustic metric that measures quality to minimize the amount of data transmitted to said client machine while maintaining a minimum acoustic quality for each of said possible acoustic units.
-
-
14. The method of claim 10, further comprising:
-
caching a number of frequently used uncompressed acoustic units in a cache memory of said client machine; and
concatenating said decompressed acoustic units with at least one of said uncompressed acoustic units.
-
-
15. A text-to-speech synthesis system programmed to perform the method of claim 10, said text-to-speech synthesis system comprising:
-
said acoustic unit database;
said server machine;
said client machine; and
means for enabling data transmission and communication among said acoustic unit database, said server machine, and said client machine.
-
-
16. A computer-readable medium storing a computer-executable program implementing the text-to-speech synthesis method of claim 10.
-
17. In a client machine, a text-to-speech synthesis method comprising:
-
a) receiving compressed acoustic units corresponding to a normalized text from a server machine, said compressed acoustic units being selected from a predetermined number of possible acoustic units and compressed using a compression method selected in dependence on said predetermined number of possible acoustic units;
b) decompressing said compressed acoustic units to obtain decompressed acoustic units;
c) receiving prosody data corresponding to said normalized text from said server machine; and
d) concatenating said decompressed acoustic units in dependence of said prosody data. - View Dependent Claims (18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32)
selecting parameters of said compression method to minimize the amount of data transmitted to said client machine while maintaining a minimum acoustic quality for each of said possible acoustic unit.
-
-
23. The method of claim 22, further comprising:
utilizing an optimized search directed by an acoustic metric that measures said minimum acoustic quality.
-
24. A computer-readable medium storing a computer-executable program implementing the text-to-speech synthesis method of claim 23.
-
25. A computer-readable medium storing a computer-executable program implementing the text-to-speech synthesis method of claim 22.
-
26. The method of claim 17 wherein steps (b), (c), and (d) occur before step (a) is completed.
-
27. A text-to-speech synthesis system programmed to perform the method of claim 17, said text-to-speech synthesis system comprising:
-
an acoustic unit database for storing said predetermined number of possible acoustic units;
said server machine in communication with said acoustic unit database;
said client machine in communication with said server machine; and
means for enabling data transmission and communication among said acoustic unit database, said server machine, and said client machine.
-
-
28. The system of claim 27, wherein said client machine further comprises:
-
means for normalizing a standard text to obtain said normalized text; and
means for transmitting said normalized text to said server machine.
-
-
29. The system of claim 27, wherein said client machine further comprises:
-
means for receiving said compressed acoustic units;
means for decompressing said compressed acoustic units; and
means for concatenating said decompressed acoustic units.
-
-
30. The system of claim 27, wherein said client machine further comprises:
a cache memory for caching at least one uncompressed acoustic unit.
-
31. The system of claim 27, wherein said server machine further comprises:
means for normalizing a standard text to obtain said normalized text, wherein said standard text is received from said client machine or a different source, or is generated by said server machine.
-
32. A computer-readable medium storing a computer-executable program implementing the text-to-speech synthesis method of claim 17.
Specification