Providing personalized voice font for text-to-speech applications
First Claim
1. A method implemented on a computing device having instructions executable by a processor for synthesizing speech from a text, the speech being in a specified voice, the method comprising:
- accessing a text-to-speech application through a browser in communication with a network by a user of a client computer;
generating a personalized voice font based on the one or more waveforms, wherein the user creates a personalized speech audio data at the client computer by speaking a plurality of predetermined utterances into a microphone connected to the client computer, the personalized speech audio data is encoded into a waveform at the client computer, and the waveform is transmitted to a voice font generator of the text-to-speech application over the network, wherein generating the personal voice font after the waveform is transmitted to the voice font generator comprises;
associating the personalized speech audio data transmitted to the voice font generator with corresponding basic phonetic units, wherein the plurality of predetermined utterances is parsed into one or more basic phonetic units comprising at least one of phonemes, diphones, semi-syllables, or syllables,identifying the one or more basic phonetic units based on corresponding characteristics of a basic phonetic unit, andassociating the one or more basic phonetic units with corresponding segments of the waveform in a data structure, wherein the data structure comprises a table having one column correspond to one or more identifiers of the one or more basic phonetic units, and having another column correspond to the segments of the waveform, wherein each identifier corresponds to one or more segments of the waveform in the table;
selecting the personalized voice font, wherein a selection is made by the user via the browser of the client computer;
receiving through the browser of the client computer one or more waveforms characteristic of a voice of a person selected by the user;
submitting the text from the user'"'"'s client computer via the browser to the text-to-speech application;
synthesizing speech in the text-to-speech application based on the selected personalized voice font;
concatenating the personalized voice font into a chain according to an order of basic phonetic units in the text, the basic phonetic units are parsed into phonemes, diphones, semi-syllables, or syllables and identified by an associated diphone, a triphone, a semi-syllable, or a syllable that is associated with a corresponding segment in a waveform;
downloading concatenated speech segments from a remote computer to the client computer;
transmitting synthesized speech back to the user of the client computer through the browser; and
delivering to the user from the text-to-speech application through the browser of the client computer the personalized voice font, whereby speech can be synthesized from text, the speech being in the voice of the selected person, the speech being synthesized using the personalized voice font.
2 Assignments
0 Petitions
Accused Products
Abstract
A method for synthesizing speech from text includes receiving one or more waveforms characteristic of a voice of a person selected by a user, generating a personalized voice font based on the one or more waveforms, and delivering the personalized voice font to the user'"'"'s computer, whereby speech can be synthesized from text, the speech being in the voice of the selected person, the speech being synthesized using the personalized voice font. A system includes a text-to-speech (TTS) application operable to generate a voice font based on speech waveforms transmitted from a client computer remotely accessing the TTS application.
320 Citations
28 Claims
-
1. A method implemented on a computing device having instructions executable by a processor for synthesizing speech from a text, the speech being in a specified voice, the method comprising:
-
accessing a text-to-speech application through a browser in communication with a network by a user of a client computer; generating a personalized voice font based on the one or more waveforms, wherein the user creates a personalized speech audio data at the client computer by speaking a plurality of predetermined utterances into a microphone connected to the client computer, the personalized speech audio data is encoded into a waveform at the client computer, and the waveform is transmitted to a voice font generator of the text-to-speech application over the network, wherein generating the personal voice font after the waveform is transmitted to the voice font generator comprises; associating the personalized speech audio data transmitted to the voice font generator with corresponding basic phonetic units, wherein the plurality of predetermined utterances is parsed into one or more basic phonetic units comprising at least one of phonemes, diphones, semi-syllables, or syllables, identifying the one or more basic phonetic units based on corresponding characteristics of a basic phonetic unit, and associating the one or more basic phonetic units with corresponding segments of the waveform in a data structure, wherein the data structure comprises a table having one column correspond to one or more identifiers of the one or more basic phonetic units, and having another column correspond to the segments of the waveform, wherein each identifier corresponds to one or more segments of the waveform in the table; selecting the personalized voice font, wherein a selection is made by the user via the browser of the client computer; receiving through the browser of the client computer one or more waveforms characteristic of a voice of a person selected by the user; submitting the text from the user'"'"'s client computer via the browser to the text-to-speech application; synthesizing speech in the text-to-speech application based on the selected personalized voice font; concatenating the personalized voice font into a chain according to an order of basic phonetic units in the text, the basic phonetic units are parsed into phonemes, diphones, semi-syllables, or syllables and identified by an associated diphone, a triphone, a semi-syllable, or a syllable that is associated with a corresponding segment in a waveform; downloading concatenated speech segments from a remote computer to the client computer; transmitting synthesized speech back to the user of the client computer through the browser; and delivering to the user from the text-to-speech application through the browser of the client computer the personalized voice font, whereby speech can be synthesized from text, the speech being in the voice of the selected person, the speech being synthesized using the personalized voice font. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. A computer-readable storage medium for storing computer-executable instructions that, when executed, cause a computer to perform a process comprising:
-
receiving via a microphone at a user'"'"'s computer, audio input corresponding to a voice of a selected speaker, wherein a personalized speech audio data is created by speaking a plurality of predetermined utterances into the microphone of the user'"'"'s computer; encoding the audio input into a waveform; generating a personalized voice font based on the waveform; accessing a text-to-speech application through a browser on the user'"'"'s computer, wherein the browser is in communication with a network; transmitting the waveform to a voice font generator of a text-to-speech (TTS) engine residing on a remote computer that is in communication with the browser of the user'"'"'s computer via the network to generate the personalized voice font, wherein generating the personalized voice font after transmitting the waveform to the voice font generator comprises; associating the personalized speech audio data transmitted to the voice font generator with corresponding basic phonetic units, wherein the plurality of predetermined utterances is parsed into one or more basic phonetic units comprising at least one of phonemes, diphones, semi-syllables, or syllables, identifying the one or more basic phonetic units based on corresponding characteristics of a basic phonetic unit, and associating the one or more basic phonetic units with corresponding segments of the waveform in a data structure, wherein the data structure comprises a table having one column correspond to one or more identifiers of the one or more basic phonetic units, and having another column correspond to the segments of the waveform, wherein each identifier corresponds to one or more segments of the waveform in the table; transmitting a text from the user'"'"'s computer to the TTS engine via the network; selecting the personalized voice font using a voice font selector, wherein the voice font selector is in communication with the browser of the user'"'"'s computer via the network; instructing the TTS engine to generate synthesized speech based on the text transmitted to the TTS engine; concatenating the personalized voice font into a chain according to an order of the basic phonetic units in the text, the basic phonetic units are parsed into phonemes, diphones, semi-syllables, or syllables and identified by an associated diphone, a triphone, a semi-syllable, or a syllable that is associated with a corresponding segment in a waveform; downloading concatenated speech segments to the user'"'"'s computer; and receiving to the user'"'"'s computer via the network synthesized speech from the TTS engine, the synthesized speech corresponding to the text and being synthesized with the personalized voice font representative of the selected speaker'"'"'s voice. - View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 21, 22)
-
-
23. A system for synthesizing speech from a text comprising:
-
a server in communication via a network, with a browser on a client computer of a user; a text-to-speech (TTS) application, in communication with the client computer of the user, operable to generate a voice font based on speech waveforms, wherein the user creates a personalized speech audio data on the client computer, and the personalized speech audio data is encoded into one or more waveforms at the client computer, wherein the waveforms are transmitted from the client computer remotely accessing a voice font generator of the TTS application via the network, wherein generating the voice font after the waveforms are transmitted comprises; associating the waveforms transmitted to the voice font generator with corresponding basic phonetic units, wherein the plurality of predetermined utterances is parsed into one or more basic phonetic units comprising at least one of phonemes, diphones, semi-syllables, or syllables, identifying the one or more basic phonetic units based on corresponding characteristics of a basic phonetic unit, and associating the one or more basic phonetic units with corresponding segments of the waveforms in a data structure, wherein the data structure comprises a table having one column correspond to one or more identifiers of the one or more basic phonetic units, and having another column correspond to the segments of the waveforms, wherein each identifier corresponds to one or more segments of the waveforms in the table; a text to speech engine to concatenate a personalized voice font into a chain according to an order of the basic phonetic units in the text, the basic phonetic units are parsed into phonemes, diphones, semi-syllables, or syllables and identified by an associated diphone, a triphone, a semi-syllable, or a syllable that is associated with a corresponding segment in a waveform; the text to speech engine to download concatenated speech segments to the client computer; and a TTS web service having a user interface, wherein the user interface is a function selector, a voice font selector and other services configured to allow a user to remotely perform text-to-speech through the network. - View Dependent Claims (24, 25, 26, 27, 28)
-
Specification