Method, apparatus and computer program providing a multi-speaker database for concatenative text-to-speech synthesis

US 7,716,052 B2
Filed: 04/07/2005
Issued: 05/11/2010
Est. Priority Date: 04/07/2005
Status: Expired due to Fees

First Claim

Patent Images

1. A method comprising:

receiving a text word; and

in response to receiving the text word, concatenating, by a data processor coupled to a memory, pre-recorded speech segments that are derived from a plurality of speakers to form audio data configured to generate an audible speech word that corresponds to the text word,wherein concatenating the pre-recorded speech segments comprises selecting speech segments for concatenation based on at least one cost function,where the at least one cost function comprises a first cost function where a cost of a speech segment from a particular speaker of the plurality of speakers is based at least in part on a size of a dataset comprising pre-recorded speech segments from the particular speaker as compared to sizes of other datasets each comprising pre-recorded speech segments from other speakers in the plurality of speakers,where the first cost function assigns a first cost for a first speech segment from a first speaker of the plurality of speakers that is lower than a second cost for a second speech segment from a second speaker of the plurality of speakers,where a first size of pre-recorded speech segments in a first dataset from the first speaker is greater than a second size of pre-recorded speech segments in a second dataset from the second speaker.

View all claims

9 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method, apparatus and a computer program product to generate an audible speech word that corresponds to text. The method includes providing a text word and, in response to the text word, processing pre-recorded speech segments that are derived from a plurality of speakers to selectively concatenate together speech segments based on at least one cost function to form audio data for generating an audible speech word that corresponds to the text word. A data structure is also provided for use in a concatenative text-to-speech system that includes a plurality of speech segments derived from a plurality of speakers, where each speech segment includes an associated attribute vector each of which is comprised of at least one attribute vector element that identifies the speaker from which the speech segment was derived.

35 Citations

View as Search Results

17 Claims

1. A method comprising:
- receiving a text word; and
  
  in response to receiving the text word, concatenating, by a data processor coupled to a memory, pre-recorded speech segments that are derived from a plurality of speakers to form audio data configured to generate an audible speech word that corresponds to the text word,wherein concatenating the pre-recorded speech segments comprises selecting speech segments for concatenation based on at least one cost function,where the at least one cost function comprises a first cost function where a cost of a speech segment from a particular speaker of the plurality of speakers is based at least in part on a size of a dataset comprising pre-recorded speech segments from the particular speaker as compared to sizes of other datasets each comprising pre-recorded speech segments from other speakers in the plurality of speakers,where the first cost function assigns a first cost for a first speech segment from a first speaker of the plurality of speakers that is lower than a second cost for a second speech segment from a second speaker of the plurality of speakers,where a first size of pre-recorded speech segments in a first dataset from the first speaker is greater than a second size of pre-recorded speech segments in a second dataset from the second speaker.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method as in claim 1, where each pre-recorded speech segment comprises an attribute vector, and each attribute vector comprises a vector element that identifies the speaker from which the speech segment was derived.
  - 3. The method as in claim 2, where each attribute vector further comprises another vector element that identifies a style of speech from which the speech segment was derived.
  - 4. The method of claim 2, where the attribute vector of a pre-recorded speech segment comprises a cost matrix which specifies a cost of using the pre-recorded speech segment for each potential target speaker.
  - 5. The method as in claim 1, where the pre-recorded speech segments are pre-recorded by a process that comprises designating one speaker as a target speaker, examining an input speech segment to determine if it is similar to a corresponding speech segment of the target speaker and, if it is not, modifying at least one characteristic of the input speech segment to make it more similar to the corresponding speech segment of the target speaker.
  - 6. The method as in claim 5, where modifying comprises altering at least one of a temporal or a spectral characteristic of the input speech segment.
  - 7. The method as in claim 1, where a speech segment comprises at least one of a phoneme, a syllable, and a word.
  - 8. The method as in claim 1, where at least some of the pre-recorded speech segments are derived from a speaker by sampling, digitizing and partitioning spoken words into word units.
  - 9. The method of claim 1, where the audible speech word is an audible speech word that sounds as though spoken by a target speaker.

10. An apparatus comprising:
- a memory configured to store pre-recorded speech segments that are derived from a plurality of speakers; and
  
  a data processor configured to, in response to receiving a text word, concatenate the pre-recorded speech segments to form audio data configured to generate an audible speech word that corresponds to the text word,wherein concatenating the pre-recorded speech segments comprises selecting speech segments for concatenation based on at least one cost function,where the at least one cost function comprises a first cost function where a cost of a speech segment from a particular speaker of the plurality of speakers is based at least in part on a size of a dataset comprising pre-recorded speech segments from the particular speaker as compared to sizes of other datasets each comprising pre-recorded speech segments from other speakers in the plurality of speakers,where the first cost function assigns a first cost for a first speech segment from a first speaker of the plurality of speakers that is lower than a second cost for a second speech segment from a second speaker of the plurality of speakers,where a first size of pre-recorded speech segments in a first dataset from the first speaker is greater than a second size of pre-recorded speech segments in a second dataset from the second speaker.
- View Dependent Claims (11, 12, 13)
- - 11. The apparatus of claim 10, where each pre-recorded speech segment comprises an attribute vector, and each attribute vector comprises a vector element that identifies the speaker from which the pre-recorded speech segment was derived.
  - 12. The apparatus of claim 11, where the attribute vector of a pre-recorded speech segment comprises a cost matrix which specifies a cost of using the pre-recorded speech segment for each potential target speaker.
  - 13. The apparatus of claim 11, where each attribute vector further comprises another vector element that identifies a style of speech from which the pre-recorded speech segment was derived.

14. A computer readable medium tangibly embodying a program of instructions executable by a machine to perform operations, the operations comprising:
- in response to receiving a text word, concatenating pre-recorded speech segments that are derived from a plurality of speakers to form audio data configured to generate an audible speech word that corresponds to the text word,wherein concatenating the pre-recorded speech segments comprises selecting speech segments for concatenation based on at least one cost function,where the at least one cost function comprises a first cost function where a cost of a speech segment from a particular speaker of the plurality of speakers is based at least in part on a size of a dataset comprising pre-recorded speech segments from the particular speaker as compared to sizes of other datasets each comprising pre-recorded speech segments from other speakers in the plurality of speakers,where the first cost function assigns a first cost for a first speech segment from a first speaker of the plurality of speakers that is lower than a second cost for a second speech segment from a second speaker of the plurality of speakers,where a first size of pre-recorded speech segments in a first dataset from the first speaker is greater than a second size of pre-recorded speech segments in a second dataset from the second speaker.
- View Dependent Claims (15, 16, 17)
- - 15. The computer readable medium of claim 14, where each pre-recorded speech segment comprises an attribute vector, and each attribute vector comprises a vector element that identifies the speaker from which the pre-recorded speech segment was derived.
  - 16. The computer readable medium of claim 15 where the attribute vector of a pre-recorded speech segment comprises a cost matrix which specifies a cost of using the pre-recorded speech segment for each potential target speaker.
  - 17. The computer readable medium of claim 15, where each attribute vector further comprises another vector element that identifies a style of speech from which the pre-recorded speech segment was derived.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Cerence Inc., Cerence Operating Company (Cerence Inc.)
Original Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Inventors
Eide, Ellen M., Aaron, Andrew S., Shuang, Zhi Wei, Picheny, Michael A., Hamza, Wael M., Smith, Maria E., Rutherfoord, Charles T.
Primary Examiner(s)
Dorvil; Richemond
Assistant Examiner(s)
He; Jialong

Application Number

US11/101,223
Publication Number

US 20060229876A1
Time in Patent Office

1,860 Days
Field of Search

704258-269
US Class Current

704/258
CPC Class Codes

G10L 13/07 Concatenation rules

G10L 2021/0135 Voice conversion or morphing

Method, apparatus and computer program providing a multi-speaker database for concatenative text-to-speech synthesis

First Claim

9 Assignments

0 Petitions

Accused Products

Abstract

35 Citations

17 Claims

Specification

Use Cases

Quick Links

Others

Method, apparatus and computer program providing a multi-speaker database for concatenative text-to-speech synthesis

First Claim

9 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

35 Citations

17 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others