Client/server architecture for text-to-speech synthesis

US 6,810,379 B1
Filed: 04/24/2001
Issued: 10/26/2004
Est. Priority Date: 04/24/2000
Status: Expired due to Fees

First Claim

Patent Images

1. In a computer system comprising a server machine and a client machine, a text-to-speech synthesis method comprising:

describing a finite number of possible acoustic units;

optimizing a compression method selected in dependence of said finite number of possible acoustic units, wherein said optimizing step further comprises selecting parameters of said compression method utilizing a directed optimized search to minimize the amount of data transmitted between said server machine and said client machine;

compressing said finite number of possible acoustic units via said optimized compression method;

storing said finite number of possible acoustic units as compressed acoustic units in an acoustic unit database accessible to said server machine;

in said server machine, obtaining a normalized text and generating prosody data thereof;

selecting from said acoustic unit database compressed acoustic units that correspond to said normalized text;

transmitting said prosody data and said selected compressed acoustic units from said server machine to said client machine; and

in said client machine, decompressing said transmitted acoustic units and concatenating said decompressed acoustic units in accordance with said prosody data.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A client/server text-to-speech synthesis system and method divides the method optimally between client and server. The server stores large databases for pronunciation analysis, prosody generation, and acoustic unit selection corresponding to a normalized text, while the client performs computationally intensive decompression and concatenation of selected acoustic units to generate speech. The units are transmitted from the client to the server in a highly compressed format, with a compression method selected based on the predetermined set of potential acoustic units. This compression method allows for very high-quality and natural-sounding speech to be output at the client machine.

Citations

32 Claims

1. In a computer system comprising a server machine and a client machine, a text-to-speech synthesis method comprising:
- describing a finite number of possible acoustic units;
  
  optimizing a compression method selected in dependence of said finite number of possible acoustic units, wherein said optimizing step further comprises selecting parameters of said compression method utilizing a directed optimized search to minimize the amount of data transmitted between said server machine and said client machine;
  
  compressing said finite number of possible acoustic units via said optimized compression method;
  
  storing said finite number of possible acoustic units as compressed acoustic units in an acoustic unit database accessible to said server machine;
  
  in said server machine, obtaining a normalized text and generating prosody data thereof;
  
  selecting from said acoustic unit database compressed acoustic units that correspond to said normalized text;
  
  transmitting said prosody data and said selected compressed acoustic units from said server machine to said client machine; and
  
  in said client machine, decompressing said transmitted acoustic units and concatenating said decompressed acoustic units in accordance with said prosody data.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method of claim 1, wherein
- 3. The method of claim 1, further comprising:
  - caching a number of frequently used uncompressed acoustic units in a cache memory of said client machine; and
    
    concatenating said decompressed acoustic units with at least one of said uncompressed acoustic units.
- 4. The method of claim 1, further comprising normalizing a standard text to obtain said normalized text.
- 5. The method of claim 1, further comprising:
  - sending a standard text to said server machine;
    
    in said server machine, normalizing said standard text to obtain said normalized text.
- 6. The method of claim 1, wherein said optimized search is directed by an acoustic metric that measures quality.
- 7. The method of claim 1, wherein said describing step further comprises:
  - dividing each of said possible acoustic units into sequences of chunks of equal duration; and
    
    describing frequency composition of each chunk with a set of parameters.
- 8. A text-to-speech synthesis system programmed to perform the method of claim 1, said text-to-speech synthesis system comprising:
  - said acoustic unit database;
    
    said server machine in communication with said acoustic unit database; and
    
    said client machine in communication with said server machine.
- 9. A computer-readable program storage device tangibly embodying a computer-executable program implementing the text-to-speech synthesis method of claim 1.

10. In a computer system comprising a server machine and a client machine, a text-to-speech synthesis method comprising:
- in said server machine, obtaining a normalized text;
  
  selecting compressed acoustic units corresponding to said normalized text from a database storing a predetermined number of possible acoustic units that have been optimally compressed;
  
  transmitting said selected compressed acoustic units to said client machine;
  
  generating prosody data corresponding to said normalized text and transmitting said prosody data to said client machine;
  
  in said client machine, decompressing said transmitted acoustic units; and
  
  concatenating said decompressed acoustic units.
- View Dependent Claims (11, 12, 13, 14, 15, 16)
- - 11. The method of claim 10, further comprising normalizing a standard text to obtain said normalized text.
  - 12. The method of claim 10, wherein said decompressing step and said concatenating step begin before all of said selected compressed acoustic units are received in said client machine.
  - 13. The method of claim 10, further comprising:
14. The method of claim 10, further comprising:
- caching a number of frequently used uncompressed acoustic units in a cache memory of said client machine; and
  
  concatenating said decompressed acoustic units with at least one of said uncompressed acoustic units.
15. A text-to-speech synthesis system programmed to perform the method of claim 10, said text-to-speech synthesis system comprising:
- said acoustic unit database;
  
  said server machine;
  
  said client machine; and
  
  means for enabling data transmission and communication among said acoustic unit database, said server machine, and said client machine.
16. A computer-readable medium storing a computer-executable program implementing the text-to-speech synthesis method of claim 10.

17. In a client machine, a text-to-speech synthesis method comprising:
- a) receiving compressed acoustic units corresponding to a normalized text from a server machine, said compressed acoustic units being selected from a predetermined number of possible acoustic units and compressed using a compression method selected in dependence on said predetermined number of possible acoustic units;
  
  b) decompressing said compressed acoustic units to obtain decompressed acoustic units;
  
  c) receiving prosody data corresponding to said normalized text from said server machine; and
  
  d) concatenating said decompressed acoustic units in dependence of said prosody data.
- View Dependent Claims (18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32)
- - 18. The method of claim 17 wherein step (c) further comprises concatenating said decompressed acoustic units with at least one cached acoustic unit.
  - 19. The method of claim 17 further comprising, before step (a), transmitting a standard text corresponding to said normalized text to said server machine.
  - 20. The method of claim 17 further comprising, before step (a), normalizing a standard text to obtain a normalized text, and transmitting said normalized text to said server machine.
  - 21. A computer-readable medium storing a computer-executable program implementing the text-to-speech synthesis method of claim 20.
  - 22. The method of claim 17, further comprising:
23. The method of claim 22, further comprising:
- utilizing an optimized search directed by an acoustic metric that measures said minimum acoustic quality.
24. A computer-readable medium storing a computer-executable program implementing the text-to-speech synthesis method of claim 23.
25. A computer-readable medium storing a computer-executable program implementing the text-to-speech synthesis method of claim 22.
26. The method of claim 17 wherein steps (b), (c), and (d) occur before step (a) is completed.
27. A text-to-speech synthesis system programmed to perform the method of claim 17, said text-to-speech synthesis system comprising:
- an acoustic unit database for storing said predetermined number of possible acoustic units;
  
  said server machine in communication with said acoustic unit database;
  
  said client machine in communication with said server machine; and
  
  means for enabling data transmission and communication among said acoustic unit database, said server machine, and said client machine.
28. The system of claim 27, wherein said client machine further comprises:
- means for normalizing a standard text to obtain said normalized text; and
  
  means for transmitting said normalized text to said server machine.
29. The system of claim 27, wherein said client machine further comprises:
- means for receiving said compressed acoustic units;
  
  means for decompressing said compressed acoustic units; and
  
  means for concatenating said decompressed acoustic units.
30. The system of claim 27, wherein said client machine further comprises:
- a cache memory for caching at least one uncompressed acoustic unit.
31. The system of claim 27, wherein said server machine further comprises:
- means for normalizing a standard text to obtain said normalized text, wherein said standard text is received from said client machine or a different source, or is generated by said server machine.
32. A computer-readable medium storing a computer-executable program implementing the text-to-speech synthesis method of claim 17.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Sensory Incorporated
Original Assignee
Sensory Incorporated
Inventors
Vermeulen, Pieter, Mozer, Todd F.
Primary Examiner(s)
Chawan, Vijay
Assistant Examiner(s)
Nolan, Daniel A.

Application Number

US09/842,358
Time in Patent Office

1,281 Days
Field of Search

704/260, 704/200, 704/258, 704/246, 704/270.1, 704/268, 704/261, 704/264, 704/270, 340/514
US Class Current

704/260
CPC Class Codes

G10L 13/07 Concatenation rules

Client/server architecture for text-to-speech synthesis

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

32 Claims

Specification

Solutions

Use Cases

Quick Links

Client/server architecture for text-to-speech synthesis

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

32 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links