Method, apparatus and computer program providing a multi-speaker database for concatenative text-to-speech synthesis
First Claim
1. A method comprising:
- receiving a text word; and
in response to receiving the text word, concatenating, by a data processor coupled to a memory, pre-recorded speech segments that are derived from a plurality of speakers to form audio data configured to generate an audible speech word that corresponds to the text word,wherein concatenating the pre-recorded speech segments comprises selecting speech segments for concatenation based on at least one cost function,where the at least one cost function comprises a first cost function where a cost of a speech segment from a particular speaker of the plurality of speakers is based at least in part on a size of a dataset comprising pre-recorded speech segments from the particular speaker as compared to sizes of other datasets each comprising pre-recorded speech segments from other speakers in the plurality of speakers,where the first cost function assigns a first cost for a first speech segment from a first speaker of the plurality of speakers that is lower than a second cost for a second speech segment from a second speaker of the plurality of speakers,where a first size of pre-recorded speech segments in a first dataset from the first speaker is greater than a second size of pre-recorded speech segments in a second dataset from the second speaker.
9 Assignments
0 Petitions
Accused Products
Abstract
A method, apparatus and a computer program product to generate an audible speech word that corresponds to text. The method includes providing a text word and, in response to the text word, processing pre-recorded speech segments that are derived from a plurality of speakers to selectively concatenate together speech segments based on at least one cost function to form audio data for generating an audible speech word that corresponds to the text word. A data structure is also provided for use in a concatenative text-to-speech system that includes a plurality of speech segments derived from a plurality of speakers, where each speech segment includes an associated attribute vector each of which is comprised of at least one attribute vector element that identifies the speaker from which the speech segment was derived.
35 Citations
17 Claims
-
1. A method comprising:
-
receiving a text word; and in response to receiving the text word, concatenating, by a data processor coupled to a memory, pre-recorded speech segments that are derived from a plurality of speakers to form audio data configured to generate an audible speech word that corresponds to the text word, wherein concatenating the pre-recorded speech segments comprises selecting speech segments for concatenation based on at least one cost function, where the at least one cost function comprises a first cost function where a cost of a speech segment from a particular speaker of the plurality of speakers is based at least in part on a size of a dataset comprising pre-recorded speech segments from the particular speaker as compared to sizes of other datasets each comprising pre-recorded speech segments from other speakers in the plurality of speakers, where the first cost function assigns a first cost for a first speech segment from a first speaker of the plurality of speakers that is lower than a second cost for a second speech segment from a second speaker of the plurality of speakers, where a first size of pre-recorded speech segments in a first dataset from the first speaker is greater than a second size of pre-recorded speech segments in a second dataset from the second speaker. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. An apparatus comprising:
-
a memory configured to store pre-recorded speech segments that are derived from a plurality of speakers; and a data processor configured to, in response to receiving a text word, concatenate the pre-recorded speech segments to form audio data configured to generate an audible speech word that corresponds to the text word, wherein concatenating the pre-recorded speech segments comprises selecting speech segments for concatenation based on at least one cost function, where the at least one cost function comprises a first cost function where a cost of a speech segment from a particular speaker of the plurality of speakers is based at least in part on a size of a dataset comprising pre-recorded speech segments from the particular speaker as compared to sizes of other datasets each comprising pre-recorded speech segments from other speakers in the plurality of speakers, where the first cost function assigns a first cost for a first speech segment from a first speaker of the plurality of speakers that is lower than a second cost for a second speech segment from a second speaker of the plurality of speakers, where a first size of pre-recorded speech segments in a first dataset from the first speaker is greater than a second size of pre-recorded speech segments in a second dataset from the second speaker. - View Dependent Claims (11, 12, 13)
-
-
14. A computer readable medium tangibly embodying a program of instructions executable by a machine to perform operations, the operations comprising:
-
in response to receiving a text word, concatenating pre-recorded speech segments that are derived from a plurality of speakers to form audio data configured to generate an audible speech word that corresponds to the text word, wherein concatenating the pre-recorded speech segments comprises selecting speech segments for concatenation based on at least one cost function, where the at least one cost function comprises a first cost function where a cost of a speech segment from a particular speaker of the plurality of speakers is based at least in part on a size of a dataset comprising pre-recorded speech segments from the particular speaker as compared to sizes of other datasets each comprising pre-recorded speech segments from other speakers in the plurality of speakers, where the first cost function assigns a first cost for a first speech segment from a first speaker of the plurality of speakers that is lower than a second cost for a second speech segment from a second speaker of the plurality of speakers, where a first size of pre-recorded speech segments in a first dataset from the first speaker is greater than a second size of pre-recorded speech segments in a second dataset from the second speaker. - View Dependent Claims (15, 16, 17)
-
Specification