System and method for low-latency web-based text-to-speech without plugins
First Claim
1. A method comprising:
- receiving, at a computing device and from a client and over a network, text associated with a request for text-to-speech synthesis;
determining a network latency caused by the network for the text received from the client to yield a determination, wherein the network latency indicates a delay in receiving the text over the network;
performing, via a processor of the computing device, an analysis of an amount of the text to identify a plurality of intonational phrases in the text, wherein the amount of the text being analyzed and received from the client is chosen based on the network latency;
generating, via the processor, a first file containing text-to-speech data for a first intonational phrase of the plurality of intonational phrases using a first text-to-speech voice;
transmitting the first file to the client in response to the request; and
while the client plays the first file, generating, via the processor, a second file containing the text-to-speech data for a second intonational phrase of the plurality of intonational phrases using a second text-to-speech voice.
8 Assignments
0 Petitions
Accused Products
Abstract
Disclosed herein are systems, methods, and non-transitory computer-readable storage media for reducing latency in web-browsing TTS systems without the use of a plug-in or Flash® module. A system configured according to the disclosed methods allows the browser to send prosodically meaningful sections of text to a web server. A TTS server then converts intonational phrases of the text into audio and responds to the browser with the audio file. The system saves the audio file in a cache, with the file indexed by a unique identifier. As the system continues converting text into speech, when identical text appears the system uses the cached audio corresponding to the identical text without the need for re-synthesis via the TTS server.
-
Citations
20 Claims
-
1. A method comprising:
-
receiving, at a computing device and from a client and over a network, text associated with a request for text-to-speech synthesis; determining a network latency caused by the network for the text received from the client to yield a determination, wherein the network latency indicates a delay in receiving the text over the network; performing, via a processor of the computing device, an analysis of an amount of the text to identify a plurality of intonational phrases in the text, wherein the amount of the text being analyzed and received from the client is chosen based on the network latency; generating, via the processor, a first file containing text-to-speech data for a first intonational phrase of the plurality of intonational phrases using a first text-to-speech voice; transmitting the first file to the client in response to the request; and while the client plays the first file, generating, via the processor, a second file containing the text-to-speech data for a second intonational phrase of the plurality of intonational phrases using a second text-to-speech voice. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
-
-
14. A system comprising:
-
a processor; and a computer-readable storage medium having instructions stored which, when executed by the processor, cause the processor to perform operations comprising; receiving, from a client and over a network, text associated with a request for text-to-speech synthesis; determining a network latency caused by the network for the text received from the client to yield a determination, wherein the network latency indicates a delay in receiving the text over the network; performing an analysis of an amount of the text to identify a plurality of intonational phrases in the text, wherein the amount of the text being analyzed and received from the client is chosen based on the network latency; generating a first file containing text-to-speech data for a first intonational phrase of the plurality of intonational phrases using a first text-to-speech voice; transmitting the first file to the client in response to the request; and while the client plays the first file, generating a second file containing the text-to-speech data for a second intonational phrase of the plurality of intonational phrases using a second text-to-speech voice. - View Dependent Claims (15, 16, 17, 18, 19)
-
-
20. A computer-readable storage device having instructions stored which, when executed by a computing device, cause the computing device to perform operations comprising:
-
receiving, at the computing device and from a client and over a network, text associated with a request for text-to-speech synthesis; determining a network latency caused by the network for the text received from the client to yield a determination, wherein the network latency indicates a delay in receiving the text over the network; performing an analysis of an amount of the text to identify a plurality of intonational phrases in the text, wherein the amount of the text being analyzed and received from the client is chosen based on the network latency; generating a first file containing text-to-speech data for a first intonational phrase of the plurality of intonational phrases using a first text-to-speech voice; transmitting the first file to the client in response to the request; and while the client plays the first file, generating a second file containing the text-to-speech data for a second intonational phrase of the plurality of intonational phrases using a second text-to-speech voice.
-
Specification