Method and system for text-to-speech caching

US 7,043,432 B2
Filed: 08/29/2001
Issued: 05/09/2006
Est. Priority Date: 08/29/2001
Status: Expired due to Term

First Claim

Patent Images

1. In a text-to-speech system, a method of converting text-to-speech comprising:

receiving a text input and a plurality of attributes associated with said text input, wherein said attributes specify stress, gender, grammar, speed, and volume for an audio rendering of said text input;

generating processed input by parsing and normalizing said text input;

comparing said processed input to at least one entry in a text-to-speech cache memory, wherein said entry in said text-to-speech cache memory specifies a corresponding spoken output, wherein said text-to-speech cache memory contains a plurality of entries that specify spoken outputs, attributes for rendering spoken output, and callback information, and wherein each spoken output has an assigned score;

if said processed input matches one of said entries in said text-to-speech cache memory, providing said spoken output specified by said matching entry and rendering said spoken output according to said plurality of attributes associated with said text input;

if said processed input fails to match one of said entries, generating an additional spoken output with a text-to-speech engine, generating an entry that specifies said additional spoken output, assigning a score to said additional spoken output, storing said additional spoken output and assigned score in said cache memory, and rendering said spoken output with the text-to-speech engine according to said plurality of attributes associated with said text input, wherein each assigned score is an updatable score computed by multiplying a previous score times a constant between zero and one and adding a number equal to the number of times a corresponding entry has been accessed since a last updating of the score;

if the cache memory is full when said additional spoken output is generated, deleting from said cache memory a spoken output having a lower score; and

generating a display of said text input wherein each word of said display is successively highlighted in coordination with an audible rendering of a word of corresponding spoken output, coordination of said display and spoken output being based on call information stored in said cache memory.

View all claims

8 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

In a text-to-speech system, a method of converting text-to-speech can include receiving a text input and comparing the received text input to at least one entry in a text-to-speech cache memory. Each entry in the text-to-speech cache memory can specify a corresponding spoken output. If the text input matches one of the entries in the text-to-speech cache memory, the cached speech output specified by the matching entry can be provided.

30 Citations

View as Search Results

29 Claims

1. In a text-to-speech system, a method of converting text-to-speech comprising:
- receiving a text input and a plurality of attributes associated with said text input, wherein said attributes specify stress, gender, grammar, speed, and volume for an audio rendering of said text input;
  
  generating processed input by parsing and normalizing said text input;
  
  comparing said processed input to at least one entry in a text-to-speech cache memory, wherein said entry in said text-to-speech cache memory specifies a corresponding spoken output, wherein said text-to-speech cache memory contains a plurality of entries that specify spoken outputs, attributes for rendering spoken output, and callback information, and wherein each spoken output has an assigned score;
  
  if said processed input matches one of said entries in said text-to-speech cache memory, providing said spoken output specified by said matching entry and rendering said spoken output according to said plurality of attributes associated with said text input;
  
  if said processed input fails to match one of said entries, generating an additional spoken output with a text-to-speech engine, generating an entry that specifies said additional spoken output, assigning a score to said additional spoken output, storing said additional spoken output and assigned score in said cache memory, and rendering said spoken output with the text-to-speech engine according to said plurality of attributes associated with said text input, wherein each assigned score is an updatable score computed by multiplying a previous score times a constant between zero and one and adding a number equal to the number of times a corresponding entry has been accessed since a last updating of the score;
  
  if the cache memory is full when said additional spoken output is generated, deleting from said cache memory a spoken output having a lower score; and
  
  generating a display of said text input wherein each word of said display is successively highlighted in coordination with an audible rendering of a word of corresponding spoken output, coordination of said display and spoken output being based on call information stored in said cache memory.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The method of claim 1, wherein said text-to-speech cache entries include an intermediate output which is not a digitally encoded audio file;
    - and wherein said text-to-speech engine converts said intermediate output to said spoken output.
  - 3. The method of claim 1, wherein said text-to-speech cache is shared across multiple text-to-speech processes, wherein said text-to-speech processes are performed by a plurality of different text-to-speech engines, each engine utilizing said text-to-speech cache.
  - 4. The method of claim 1, further comprising logging each said match of said text input with a text-to-speech cache entry.
  - 5. The method of claim 1, further comprising periodically updating each said score.
  - 6. The method of claim 1, further comprising comparing said attributes of said received text input with attributes of said entries in said text-to-speech cache memory.

7. A method of converting text-to-speech using a text-to-speech cache memory having a plurality of entries, wherein said entries comprise a processed form specifying a spoken output, wherein said processed form specifying spoken output does not comprise a digitally encoded audio file, said method comprising:
- receiving a text input and a plurality of attributes associated with said text input, wherein said attributes specify stress, gender, grammar, speed, and volume for an audio rendering of said text input;
  
  processing said text input to determine a form specifying a spoken output for said received text;
  
  comparing said determined form of said text input with said entries in said text-to-speech cache memory;
  
  assigning a score to each of said entries, wherein each score is an undatable score computed by multiplying a previous score times a constant between zero and one and adding a number equal to the number of times a corresponding entry has been accessed since a last updating of the score;
  
  if said text input matches one of said entries in said text-to-speech cache memory, providing said processed form specified by said matching entry to a text-to-speech engine;
  
  said text-to-speech engine converting said processed form to said spoken output and rendering said spoken output according to said plurality of attributes associated with said text input; and
  
  generating a display of said text input wherein each word of said display is successively highlighted in coordination with an audible rendering of a word of said spoken output, coordination of said display and spoken output being based on call information stored in said cache memory.
- View Dependent Claims (8, 9, 10)
- - 8. The method of claim 7, wherein the determined form of said text input comprises at least one of normalized text that represents a standardized version of the text input and an intermediate format used by the text-to-speech engine.
  - 9. The method of claim 7, wherein said text-to-speech cache is shared across multiple text-to-speech processes, wherein said text-to-speech processes are performed by a plurality of different text-to-speech engines, each engine utilizing said text-to-speech cache.
  - 10. The method of claim 7, further comprising logging each said match of said text input with a text-to-speech cache entry.

11. A method of converting text-to-speech comprising:
- storing a plurality of entries in a text-to-speech cache memory, wherein the text-to-speech cache memory is directly and locally coupled to at least one text-to-speech engine, wherein each said entry comprises a processed form specifying a spoken output, and wherein said text-to-speech cache memory contains a plurality of entries that specify spoken outputs, attributes for rendering spoken output, and callback information;
  
  assigning a score to each one of said plurality of entries;
  
  receiving a text input;
  
  processing said text input to determine a form specifying a spoken output for said received text;
  
  comparing said determined form of said text input with said entries in said text-to-speech cache memory;
  
  when at least one of the plurality of entries in said text-to-speech cache memory is matched to said determined form, retrieving the processed form for the matching entry from the text-to-speech cache memory, and using the processed form to generate said spoken output based on said attributes;
  
  when at least one of the plurality of entries in said text-to-speech cache memory is not matched to said determined form, using the at least one text-to-speech engine to generate said spoken output;
  
  logging when one of said plurality of entries in said text-to-speech cache memory is matched to said received text inputgenerating a display of said text input wherein each word of said display is successively highlighted in coordination with an audible rendering of a word of said spoken output, coordination of said display and spoken output being based on call information stored in said cache memory; and
  
  periodically updating said score for each one of said plurality of entries of said text-to-speech cache memory, wherein an updated score is computed by multiplying a previous score times a constant between zero and one and adding a number equal to the number of times a corresponding entry has been accessed since a last updating of the score.

12. A text-to-speech system comprising:
- a text-to-speech engine for receiving text inputs and a plurality of attributes associated with said text and for producing a spoken output representative of said received text, wherein said attributes specify stress, gender, grammar, speed, and volume for an audio rendering of said text input; and
  
  a text-to-speech cache memory for storing selected entries corresponding to received text inputs and a score assigned to each entry wherein said entries specify spoken outputs corresponding to said selected received text inputs, wherein at least one processing interaction occurs between the speech-to-text engine and the text-to-speech cache memory when the text-to-speech engine uses the text-to-speech memory cache to generate the spoken output responsive to receiving text, said processing interactions comprising at least one interaction selected from the group consisting of a pre-processing interaction where the received text is processed into an intermediate form before being compared to entries of the text-to-speech cache that are stored in said intermediate form and a post-matching interaction where the specified spoken outputs retrieved from the text-to-speech cache memory are processed by the text-to-speech engine to generate the spoken output according to the associated attributes, and wherein each score is an undatable score computed by multiplying a previous score times a constant between zero and one and adding a number equal to the number of times a corresponding entry has been accessed since a last updating of the score.
- View Dependent Claims (13, 14)
- - 13. The text-to-speech system of claim 12, wherein said text-to-speech cache entries include said spoken output, and wherein the processing interaction is a pre-processing interaction, and wherein the intermediate form comprises normalized text that represents a standardized version of the text input.
  - 14. The text-to-speech system of claim 12, wherein said text-to-speech cache is shared across multiple text-to-speech processes, wherein said text-to-speech processes are performed by a plurality of different text-to-speech engines, each engine utilizing said text-to-speech cache.

15. A machine-readable storage, having stored thereon a computer program having a plurality of code sections executable by a machine for causing the machine to perform the steps of:
- receiving a text input and a plurality of attributes associated with said text input, wherein said attributes specify stress, gender, grammar, speed, and volume for an audio rendering of said text input;
  
  generating processed input by parsing and normalizing said text input;
  
  comparing said processed input to at least one entry in a text-to-speech cache memory, wherein said entry in said text-to-speech cache memory specifies a corresponding spoken output, wherein said text-to-speech cache memory contains a plurality of entries that specify spoken outputs, attributes for rendering spoken output, and a score corresponding to each entry, wherein each spoken output has an ordinal ranking and wherein each score is an updatable score computed by multiplying a previous score times a constant between zero and one and adding a number equal to the number of times a corresponding entry has been accessed since a last updating of the score;
  
  if said processed input matches one of said entries in said text-to-speech cache memory, providing said spoken output specified by said matching entry and rendering said spoken output according to said plurality of attributes associated with said text input;
  
  if said processed input fails to match one of said entries, generating an additional spoken output with a text-to-speech engine, generating an entry that specifies said additional spoken output, assigning an ordinal ranking to said additional spoken output, storing said additional spoken output and assigned ordinal ranking in said cache memory, and rendering said spoken output with the text-to-speech engine according to said plurality of attributes associated with said text input;
  
  if the cache memory is full when said additional spoken output is generated, deleting from said cache memory a spoken output having a lower ordinal ranking; and
  
  generating a display of said text input wherein each word of said display is successively highlighted in coordination with an audible rendering of a word of corresponding spoken output, coordination of said display and spoken output being based on call information stored in said cache memory.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The machine-readable storage of claim 15, wherein said text-to-speech cache entries include an intermediate output which is not a digitally encoded audio file;
    - and wherein said text-to-speech engine converts said intermediate output to said spoken output.
  - 17. The machine-readable storage of claim 15, wherein said text-to-speech cache is shared across multiple text-to-speech processes, wherein said text-to-speech processes are performed by a plurality of different text-to-speech engines, each engine utilizing said text-to-speech cache.
  - 18. The machine-readable storage of claim 15, further comprising logging each said match of said text input with a text-to-speech cache entry.
  - 19. The machine-readable storage of claim 15, further comprising removing one of said entries in said text-to-speech cache memory.
  - 20. The machine-readable storage of claim 15, wherein each said entry in said text-to-speech cache memory has a score, said machine-readable storage further comprising periodically updating each said score.

21. A machine-readable storage, having stored thereon a computer program having a plurality of code sections executable by a machine for causing the machine to perform the steps of:
- storing a plurality of entries in a text-to-speech cache memory, wherein each one of said entries comprises a processed form specifying a spoken output wherein said processed form specifying spoken output does not comprise a digitally encoded audio file;
  
  assigning a score to each one of said plurality of entries, each score being an updatable score computed by multiplying a previous score times a constant between zero and one and adding a number equal to the number of times a corresponding entry has been accessed since a last updating of the score;
  
  receiving a text input and a plurality of attributes associated with said text input, wherein said attributes specify stress, gender, grammar, speed, and volume for an audio rendering of said text input;
  
  processing said text input to determine a form specifying a spoken output for said received text;
  
  comparing said determined form of said text input with said entries in said text-to-speech cache memory;
  
  if said text input matches one of said entries in said text-to-speech cache memory, providing said processed form specified by said matching entry to a text-to-speech engine;
  
  said text-to-speech engine converting said processed form to said spoken output and rendering said spoken output according to said plurality of attributes associated with said text input; and
  
  generating a display of said text input wherein each word of said display is successively highlighted in coordination with an audible rendering of a word of said spoken output, coordination of said display and spoken output being based on call information stored in said cache memory.
- View Dependent Claims (22, 23, 24, 25, 26, 27, 28)
- - 22. The machine-readable storage of claim 21, wherein the determined form of said text input comprises at least one of normalized text that represents a standardized version of the text input and an intermediate format used by the text-to-speech engine.
  - 23. The machine-readable storage of claim 21, wherein said text-to-speech cache is shared across multiple text-to-speech processes, wherein said text-to-speech processes are performed by a plurality of different text-to-speech engines, each engine utilizing said text-to-speech cache.
  - 24. The machine-readable storage of claim 21, further comprising logging each said match of said text input with a text-to-speech cache entry.
  - 25. The machine-readable storage of claim 21, wherein said text input does not match an entry in said text-to-speech cache memory, said method further comprising:
    - determining a spoken output corresponding to said text input by using the text-to-speech engine to text-to-speech convert the text input; and
      
      storing an entry in said text-to-speech cache memory corresponding to said text input, wherein said entry specifies said determined spoken output.
  - 26. The machine-readable storage of claim 21, further comprising removing one of said entries in said text-to-speech cache memory.
  - 27. The machine-readable storage of claim 21, wherein each said entry in said text-to-speech cache memory has a score, said machine-readable storage further comprising periodically updating each said score.
  - 28. The machine-readable storage of claim 27, further comprising removing one of said entries in said text-to-speech cache memory having a lowest score.

29. A machine-readable storage, having stored thereon a computer program having a plurality of code sections executable by a machine for causing the machine to perform the steps of:
- storing a plurality of entries in a text-to-speech cache memory, wherein the text-to-speech cache memory is directly and locally coupled to at least one text-to-speech engine, wherein each said entry comprises a processed form specifying a spoken output, and wherein said text-to-speech cache memory contains a plurality of entries that specify spoken outputs, attributes for rendering spoken output, and callback information;
  
  assigning a score to each one of said plurality of entries;
  
  receiving a text input;
  
  processing said text input to determine a form specifying a spoken output for said received text;
  
  comparing said determined form of said text input with said entries in said text-to-speech cache memory;
  
  when at least one of the plurality of entries in said text-to-speech cache memory is matched to said determined form, retrieving the processed form for the matching entry from the text-to-speech cache memory, and using the processed form to generate said spoken output based on said attributes;
  
  when at least one of the plurality of entries in said text-to-speech cache memory is not matched to said determined form, using the at least one text-to-speech engine to generate said spoken output;
  
  logging when one of said plurality of entries in said text-to-speech cache memory is matched to said received text inputgenerating a display of said text input wherein each word of said display is successively highlighted in coordination with an audible rendering of a word of said spoken output, coordination of said display and spoken output being based on call information stored in said cache memory; and
  
  periodically updating said score for each one of said plurality of entries of said text-to-speech cache memory, wherein an updated score is computed by multiplying a previous score times a constant between zero and one and adding a number equal to the number of times a corresponding entry has been accessed since a last updating of the score.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Cerence Operating Company (Cerence Inc.)
Original Assignee
International Business Machines Corporation
Inventors
Bakis, Raimo, Picheny, Michael A., Chittaluru, Hari, Ittycheriah, Abraham, Smith, Maria E., Epstein, Edward A., Lawrence, Stephen G., Friedland, Steven J., Rutherfoord, Charles
Primary Examiner(s)
Young, W. R.
Assistant Examiner(s)
Wozniak, James S.

Application Number

US09/941,301
Publication Number

US 20030046077A1
Time in Patent Office

1,714 Days
Field of Search

704/260, 704/258, 704/261, 711/118, 711/130, 711/136
US Class Current

704/260
CPC Class Codes

G10L 13/047 Architecture of speech synt...

Method and system for text-to-speech caching

First Claim

8 Assignments

0 Petitions

Accused Products

Abstract

30 Citations

29 Claims

Specification

Solutions

Use Cases

Quick Links

Method and system for text-to-speech caching

First Claim

8 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

30 Citations

29 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links