System and method for low-latency web-based text-to-speech without plugins

US 9,799,323 B2
Filed: 12/14/2015
Issued: 10/24/2017
Est. Priority Date: 12/01/2011
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

receiving, at a computing device and from a client and over a network, text associated with a request for text-to-speech synthesis;

determining a network latency caused by the network for the text received from the client to yield a determination, wherein the network latency indicates a delay in receiving the text over the network;

performing, via a processor of the computing device, an analysis of an amount of the text to identify a plurality of intonational phrases in the text, wherein the amount of the text being analyzed and received from the client is chosen based on the network latency;

generating, via the processor, a first file containing text-to-speech data for a first intonational phrase of the plurality of intonational phrases using a first text-to-speech voice;

transmitting the first file to the client in response to the request; and

while the client plays the first file, generating, via the processor, a second file containing the text-to-speech data for a second intonational phrase of the plurality of intonational phrases using a second text-to-speech voice.

View all claims

8 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Disclosed herein are systems, methods, and non-transitory computer-readable storage media for reducing latency in web-browsing TTS systems without the use of a plug-in or Flash® module. A system configured according to the disclosed methods allows the browser to send prosodically meaningful sections of text to a web server. A TTS server then converts intonational phrases of the text into audio and responds to the browser with the audio file. The system saves the audio file in a cache, with the file indexed by a unique identifier. As the system continues converting text into speech, when identical text appears the system uses the cached audio corresponding to the identical text without the need for re-synthesis via the TTS server.

Citations

20 Claims

1. A method comprising:
- receiving, at a computing device and from a client and over a network, text associated with a request for text-to-speech synthesis;
  
  determining a network latency caused by the network for the text received from the client to yield a determination, wherein the network latency indicates a delay in receiving the text over the network;
  
  performing, via a processor of the computing device, an analysis of an amount of the text to identify a plurality of intonational phrases in the text, wherein the amount of the text being analyzed and received from the client is chosen based on the network latency;
  
  generating, via the processor, a first file containing text-to-speech data for a first intonational phrase of the plurality of intonational phrases using a first text-to-speech voice;
  
  transmitting the first file to the client in response to the request; and
  
  while the client plays the first file, generating, via the processor, a second file containing the text-to-speech data for a second intonational phrase of the plurality of intonational phrases using a second text-to-speech voice.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
- - 2. The method of claim 1, wherein an intonational phrase is a phrase in which intonation within the phrase only depends on text inside the phrase.
  - 3. The method of claim 1, wherein the first intonational phrase is indexed by a first identifier, wherein the second intonational phrase is indexed by a second identifier, and wherein one of the first identifier is a first unique identifier and the second identifier is a second unique identifier.
  - 4. The method of claim 1, wherein the first file contains notification information.
  - 5. The method of claim 4, wherein the notification information comprises synchronization data.
  - 6. The method of claim 3, wherein the first unique identifier and the second unique identifier each comprises a text identifier and an offset index.
  - 7. The method of claim 1, wherein the second file contains additional notification information.
  - 8. The method of claim 1, wherein generating the second file occurs while an application plays the text-to-speech data in the first file.
  - 9. The method of claim 1, wherein the receiving and the transmitting occur on a web server, wherein the web server deletes items saved in a cache within an expiration threshold.
  - 10. The method of claim 1, further comprising transmitting one of the first file and the second file to an application in response to an additional request.
  - 11. The method of claim 1, wherein boundaries between intonational phrases comprise silence.
  - 12. The method of claim 1, further comprising:
    - receiving text-to-speech settings from the client; and
      
      generating the first file and the second file according to the text-to-speech settings.
  - 13. The method of claim 1, further comprising:
    - generating parallel versions of the first file and the second file using different text-to-speech voices.

14. A system comprising:
- a processor; and
  
  a computer-readable storage medium having instructions stored which, when executed by the processor, cause the processor to perform operations comprising;
  
  receiving, from a client and over a network, text associated with a request for text-to-speech synthesis;
  
  determining a network latency caused by the network for the text received from the client to yield a determination, wherein the network latency indicates a delay in receiving the text over the network;
  
  performing an analysis of an amount of the text to identify a plurality of intonational phrases in the text, wherein the amount of the text being analyzed and received from the client is chosen based on the network latency;
  
  generating a first file containing text-to-speech data for a first intonational phrase of the plurality of intonational phrases using a first text-to-speech voice;
  
  transmitting the first file to the client in response to the request; and
  
  while the client plays the first file, generating a second file containing the text-to-speech data for a second intonational phrase of the plurality of intonational phrases using a second text-to-speech voice.
- View Dependent Claims (15, 16, 17, 18, 19)
- - 15. The system of claim 14, wherein an intonational phrase is a phrase in which intonation within the phrase only depends on text inside the phrase.
  - 16. The system of claim 14, wherein the first intonational phrase is indexed by a first identifier, wherein the second intonational phrase is indexed by a second identifier, and wherein one of the first identifier is a first unique identifier and the second identifier is a second unique identifier.
  - 17. The system of claim 14, wherein the first file contains notification information.
  - 18. The system of claim 17, wherein the notification information comprises synchronization data.
  - 19. The system of claim 16, wherein the first unique identifier and the second unique identifier each comprises a text identifier and an offset index.

20. A computer-readable storage device having instructions stored which, when executed by a computing device, cause the computing device to perform operations comprising:
- receiving, at the computing device and from a client and over a network, text associated with a request for text-to-speech synthesis;
  
  determining a network latency caused by the network for the text received from the client to yield a determination, wherein the network latency indicates a delay in receiving the text over the network;
  
  performing an analysis of an amount of the text to identify a plurality of intonational phrases in the text, wherein the amount of the text being analyzed and received from the client is chosen based on the network latency;
  
  generating a first file containing text-to-speech data for a first intonational phrase of the plurality of intonational phrases using a first text-to-speech voice;
  
  transmitting the first file to the client in response to the request; and
  
  while the client plays the first file, generating a second file containing the text-to-speech data for a second intonational phrase of the plurality of intonational phrases using a second text-to-speech voice.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Cerence Operating Company (Cerence Inc.)
Original Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Inventors
Conkie, Alistair D., Beutnagel, Mark Charles, Mishra, Taniya
Primary Examiner(s)
ADESANYA, OLUJIMI A

Application Number

US14/967,740
Publication Number

US 20160098985A1
Time in Patent Office

680 Days
Field of Search

704258, 704260
US Class Current
CPC Class Codes

G10L 13/04 Details of speech synthesis...

G10L 13/10 Prosody rules derived from ...

System and method for low-latency web-based text-to-speech without plugins

First Claim

8 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for low-latency web-based text-to-speech without plugins

First Claim

8 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links