System and method for synthesizing human speech using multiple speakers and context
First Claim
Patent Images
1. A method of synthesizing speech from text, comprising the steps of:
- receiving text from which speech will be synthesized;
selecting, based on the received text, one or more scenario parameters, wherein the one or more scenario parameters are selected from the group consisting of language, dialect, accent, phonetic reduction, domain, context, and speaker number;
identifying text metadata within the received text, wherein the text metadata comprises elements other than words within the text;
parsing the received text, other than the identified text metadata, into a plurality of corresponding phonetic components;
merging said plurality of phonetic components with breathing and non-speech effects to produce a transcript of phoneme segment strings corresponding to the received text;
producing prosody contour data from said one or more selected scenario parameters and said transcript of phoneme segment strings;
producing stitched filter data from said one or more selected scenario parameters and said transcript of phoneme segment strings;
synthesizing speech from said stitched filter data and said prosody contour data; and
outputting said synthesized speech from a playback device.
1 Assignment
0 Petitions
Accused Products
Abstract
A system and method for realistic speech synthesis which converts text into synthetic human speech with qualities appropriate to the context such as the language and dialect of the speaker, as well as expanding a speaker'"'"'s phonetic inventory to produce more natural sounding speech.
-
Citations
31 Claims
-
1. A method of synthesizing speech from text, comprising the steps of:
-
receiving text from which speech will be synthesized; selecting, based on the received text, one or more scenario parameters, wherein the one or more scenario parameters are selected from the group consisting of language, dialect, accent, phonetic reduction, domain, context, and speaker number; identifying text metadata within the received text, wherein the text metadata comprises elements other than words within the text; parsing the received text, other than the identified text metadata, into a plurality of corresponding phonetic components; merging said plurality of phonetic components with breathing and non-speech effects to produce a transcript of phoneme segment strings corresponding to the received text; producing prosody contour data from said one or more selected scenario parameters and said transcript of phoneme segment strings; producing stitched filter data from said one or more selected scenario parameters and said transcript of phoneme segment strings; synthesizing speech from said stitched filter data and said prosody contour data; and outputting said synthesized speech from a playback device. - View Dependent Claims (2, 3, 4)
-
-
5. A method for synthesizing speech from text, comprising the steps of:
-
providing a computer having a first database and a second database stored in the memory thereof and in which data is stored, said data in said first database representing a set of signal feature candidates representative of a single speaker, and said data in said second database representing a second set of signal feature candidates representative of multiple speakers; receiving text from which speech will be synthesized; selecting, based on the received text, one or more scenario parameters, wherein the one or more scenario parameters are selected from the group consisting of language, dialect, accent, phonetic reduction, domain, context, and speaker number; identifying text metadata within the received text, wherein the text metadata comprises elements other than words within the text; parsing the received text, other than the identified text metadata, into a plurality of target phonetic components; analyzing said single speaker signal feature candidates from said first database to determine whether a corresponding single speaker signal feature candidate exists for each target phonetic component; retrieving from said second database a replacement signal feature candidate from said second set of signal feature candidates for any target phonetic component that does not have a corresponding single speaker signal feature candidate; synthesizing speech from at least one of said corresponding single signal feature candidates and said replacement signal feature candidates, the speech comprising prosody contour data and stitched filter data from said one or more selected scenario parameters. - View Dependent Claims (6, 7, 8, 9)
-
-
10. A method for synthesizing speech from text, comprising the steps of:
-
providing a computer having a first database and a second database stored in the memory thereof and in which data is stored, said data in said first database representing a set of signal feature candidates representative of a single speaker, and said data in said second database representing a second set of signal feature candidates; receiving text from which speech will be synthesized; selecting, based on the received text, one or more scenario parameters, wherein the one or more scenario parameters are selected from the group consisting of language, dialect, accent, phonetic reduction, domain, context, and speaker number; identifying text metadata within the received text, wherein the text metadata comprises elements other than words within the text; parsing the received text, other than the identified text metadata, into a plurality of target phonetic components; analyzing said single speaker signal feature candidates from said first database to determine whether a corresponding single speaker signal feature candidate of sufficient quality exists for each target phonetic component; retrieving from said second database a replacement signal feature candidate from said second set of signal feature candidates for any target phonetic component that does not have a corresponding single speaker signal feature candidate of sufficient quality; and synthesizing speech from at least one of the corresponding single speaker signal feature candidates and the replacement signal feature candidates, the speech comprising prosody contour data and stitched filter data from said one or more selected scenario parameters. - View Dependent Claims (11, 12, 13, 14, 15)
-
-
16. A non-transitory computer-readable storage medium containing program code comprising:
-
program code for receiving text from which speech with be synthesized; program code for selecting, based on the received text, one or more scenario parameters; program code for identifying text metadata within the received text, wherein the text metadata comprises elements other than words within the text; program code for parsing the received text, other than the identified text metadata, into a plurality of corresponding phonetic components; program code for merging said plurality of phonetic components with breathing and non-speech effects to produce a transcript of phoneme segment strings corresponding to the received text; program code for producing prosody contour data from said one or more selected scenario parameters and said transcript of phoneme segment strings; program code for producing stitched filter data from said one or more selected scenario parameters and said transcript of phoneme segment strings; program code for synthesizing speech from said stitched filter data and said prosody contour data; program code for outputting said synthesized speech from a playback device. - View Dependent Claims (17, 18, 19, 20)
-
-
21. A non-transitory computer-readable storage medium containing program code, comprising:
-
program code for receiving text from which speech will be synthesized; program code for selecting, based on the received text, one or more scenario parameters, wherein the one or more scenario parameters are selected from the group consisting of language, dialect, accent, phonetic reduction, domain, context, and speaker number; program code for identifying text metadata within the received text, wherein the text metadata comprises elements other than words within the text; program code for parsing the received text, other than the identified text metadata, into a plurality of target phonetic components; program code for analyzing a single speaker'"'"'s signal feature candidates, said single speaker'"'"'s signal feature candidates stored in a database, to determine whether a corresponding single speaker signal feature candidate exists for each said target phonetic component; program code for retrieving from a second set of signal feature candidates representative of multiple speakers'"'"' signal feature candidates, said second set of signal feature candidates stored in database, a replacement signal feature candidate for any target phonetic component that does not have a corresponding single speaker signal feature candidate; program code for synthesizing speech from at least one of said corresponding single speaker signal feature candidates and said replacement signal feature candidates, the speech comprising prosody contour data and stitched filter data from said one or more selected scenario parameters. - View Dependent Claims (22, 23, 24, 25)
-
-
26. A non-transitory computer-readable storage medium containing program code, comprising:
-
program code for receiving text from which speech will be synthesized; program code for selecting, based on the received text, one or more scenario parameters, wherein the one or more scenario parameters are selected from the group consisting of language, dialect, accent, phonetic reduction, domain, context, and speaker number; program code for identifying text metadata within the received text, wherein the text metadata comprises elements other than words within the text; program code for parsing the received text, other than the identified text metadata, into a plurality of target phonetic components; program code for analyzing a single speaker'"'"'s signal feature candidates, said single speaker'"'"'s signal feature candidates stored in a database, to determine whether a corresponding single speaker signal feature candidate of sufficient quality exists for each said target phonetic component; program code for retrieving from a second set of signal feature candidates, said second set of signal feature candidates stored in database, a replacement signal feature candidate for any target phonetic component that does not have a corresponding single speaker signal feature candidate of sufficient quality; program code for synthesizing speech from at least one of said corresponding single speaker signal feature candidates and said replacement signal feature candidates, the speech comprising prosody contour data and stitched filter data from said one or more selected scenario parameters. - View Dependent Claims (27, 28, 29, 30, 31)
-
Specification